Collecting data from Zillow with R

Collecting data from Zillow with R

My mom has been house hunting over the past couple of weeks, so I decided to try and use R to look at the local market. Here’s what I’ve learned:

Collecting data from Zillow was pretty easy, overall. I mostly used R packages rvest, xlm2, and tidyr.


Next, I went to Zillow and searched for homes in Denver, CO. I zoomed in on an area that I wanted to analyze and then copied the link and pulled the data in R:


The next part gets pretty complicated to explain. You essentially have to find the information you want from the webpage,which looks like a bunch of scrambled text. It’s helpful to go back to the webpage, right click, and select “View Page Source.” This will help you identify the structure of the webpage and pull the data you want. I started by parsing out the housing links from the metadata. You’ll have to remove characters to parse out the data, which I show below:

houses<- webpage %>%
  html_nodes(".zsg-pagination a") %>%

houses <-strsplit(houses,"/")
houses<-lapply(houses, function(x) x[length(x)])
houses <-max(houses)

Then I used Jonkatz2 parser function to strip the data down even further. The rest of his functions didn’t work for me =/

getZillow <- function(urls) {
   lapply(urls, function(u) {
   cat(u, '\n')
   houses <- read_html(u) %>%
              html_nodes("article") houses })
zdata<- getZillow(urls)

Instead, I ended breaking down different parts of his function to get the data that I need. The reason I had to write all of this complicated syntax is because the data is saved in a list within lists.

#to pull ID
getID<- function(data) {
  lapply(1:length(ldata), function(x) {
    ids<-num %>% html_attr("id")

#get latitude
getLAT<- function(data) {
  lapply(1:length(ldata), function(x) {
    lat<-num %>% html_attr("data-latitude")

#get longitude
getLONG<- function(data) {
  lapply(1:length(ldata), function(x) {
    long<-num %>% html_attr("data-longitude")

#get price
getPrice<- function(data) {
  lapply(1:length(ldata), function(x) {
    price<-num %>%  
      html_node(".zsg-photo-card-price") %>%

#house description
getHdesc<- function(data) {
  lapply(1:length(ldata), function(x) {
    Hdesc<-num %>%  
      html_node(".zsg-photo-card-info") %>%
      html_text() %>%

#needs to be stripped down further

getBeds<- function(data) {
  for(i in 1:length(data)) {
     beds[[i]]<- t1 %>%

getBath<- function(data) {
  for(i in 1:length(data)) {
    baths[[i]]<- t1 %>%

getSQft<- function(data) {
  for(i in 1:length(data)) {
    sqft[[i]]<- t1 %>%

#house type
getHtype<- function(data) {
    lapply(1:length(ldata), function(x) {
      Htype<-num %>%  
        html_node(".zsg-photo-card-spec") %>%

getAddy<-function(data) {
  ldata<- 1:length(data)
  lapply(1:length(ldata),function(x) {
    addy<- num %>%
      html_nodes(".zsg-photo-card-address") %>%
    html_text() %>%


#listing type
getLtype<-function(data) {
  lapply(1:length(ldata), function(x) {
    ltype<-num %>% html_attr("data-pgapt")

Now you can unlist one level:

address<-lapply(address, function(x) unlist(x))
htype<-lapply(htype, function(x) unlist(x))
id<-lapply(id, function(x) unlist(x))
lats<-lapply(lats,function(x) unlist(x))
longs<-lapply(longs,function(x) unlist(x))
list_type<-lapply(list_type,function(x) unlist(x))
price<-lapply(price,function(x) unlist(x))

Then, I put it all in a data frame:

list<-list(id, price, address, beds, baths, sqft, list_type,longs, lats, htype)
makeList<-function(data) {
  lapply(1:length(ldata), function(x) {
    ll<-num %>% unlist(recursive=FALSE) 
df<-data.frame(id=c(List[[1]]), price=c(List[[2]]), address=c(List[[3]]),
               beds=c(List[[4]]), baths=c(List[[5]]), sqft=c(List[[6]]),
               l_type=c(List[[7]]), long=c(List[[8]]), lat=c(List[[9]]),

Some of these variables are not correctly formatted. For example, latitude and longitude values were stripped of their decimal points, so I need to add them back in by first removing the factor formatting and then doing some division.

df$long <-as.numeric(as.character(df$long)) / 1000000
df$lat<-as.numeric(as.character(df$lat)) / 1000000

Also, some of my other variables have characters in them, so I want to remove that too:

df$beds <-as.numeric(gsub("[^0-9]", "",df$beds, = TRUE))
df$baths <-as.numeric(gsub("[^0-9]", "",df$baths, = TRUE))
df$sqft <-as.numeric(gsub("[^0-9]", "",df$sqft, = TRUE))
df$price <-as.numeric(gsub("[^0-9]", "",df$price, = TRUE))
#replace NAs with 0

Now I can map my data, in addition to conducting any analyses that I may want to do. Since there’s a ton of stuff out there on conducting analyses in R, I’ll just show you how I mapped my data using the leaflet package:

m <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng=df$long, lat=df$lat, popup=df$id) 

It should look like this:


If you click on the markers, they will show you the house IDs that they are associated with. You can see the web version by going to my OSF account, where I also posted the R program that I used.


Collecting Twitter Data using the twitteR package in Rstudio

Collecting Twitter Data using the twitteR package in Rstudio

Last week, I wrote a blog post about collecting data using Tweepy in Python. Like usual, I decided to recreate my work in R, so that I can compare my experience using different analytical tools. I will walk you through what I did, but I assume that you already have Rstudio installed. If not, and you wish to follow along, here’s a link to a good resource that explains how to download and install Rstudio.

Begin by loading the following libraries–download them if you don’t have them already installed.

#To download:
#install.packages(c("twitteR", "purrr", "dplyr", "stringr"),dependencies=TRUE)


Next, initiate the OAuth protocol. This of course assumes that you have registered your Twitter app. If not, here’s a link that explains how to do this.

api_key <- "your_consumer_api_key"
api_secret <-"your_consumer_api_secret"
token <- "your_access_token"
token_secret <- "your_access_secret"

setup_twitter_oauth(api_key, api_secret, token, token_secret)

Now you can use the package twitteR to collect the information that you want. For example, #rstats or #rladies <–great hashtags to follow on Twitter, btw 😉

tw = searchTwitter('#rladies + #rstats', n = 20)

which will return a list of (20) tweets that contain the two search terms that I specified:


*If you want more than 20 tweets, simply increase the number following n=

Alternatively, you can collect data on a specific user. For example, I am going to collect tweets from this awesome R-Lady, @Lego_RLady:

Again, using the twitteR package, type the following:

LegoRLady <- getUser("LEGO_RLady") #for info on the user
RLady_tweets<-userTimeline("LEGO_RLady",n=30,retryOnRateLimit=120) #to get tweets
tweets.df<-twListToDF(RLady_tweets) #turn into data frame
write.csv(tweets.df, "Rlady_tweets.csv", row.names = FALSE) #export to Excel

Luckily, she only has 27 tweets total. If you are collecting tweets from a user that has been on Twitter for longer, you’ll likely have to use a loop to continue collecting every tweet because of the rate limit. If you export to Excel, you should see something like this:

*Note: I bolded the column names and created the border to help distinguish the data

If you’re interested in the retweets and replies to @LEGO_RLady, then you can search for that specifically. To limit the amount of data, let’s limit it to any replies since the following tweet:

atRLady <- searchTwitter("@LEGO_RLady", 
                       sinceID=target_tweet, n=25, retryOnRateLimit = 20)

The atRLady.df data frame should look like this:


There’s much more data if you scroll right. You should have 16 variables total.

Sometimes there are characters in the tweet that result in errors. To make sure that the tweet is in plain text, you can do the following:

replies <- unlist(atRLady) #make sure to use the list and not the data frame

#helper function to remove characters:
clean_tweets <- function (tweet_list) {
  lapply(tweet_list, function (x) {
    x <- x$getText() # get text alone
    x <- gsub("&amp", "", x) # rm ampersands
    x <- gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*) ?", "", x) # rm links
    x <- gsub("#\\w+", "", x) # rm hashtags
    x <- gsub("@\\w+", "", x) # rm usernames
    x <- iconv(x, "latin1", "ASCII", sub="") # rm emojis
    x <- gsub("[[:punct:]]", "", x) # rm punctuation
    x <- gsub("[[:digit:]]", "", x) # rm numbers
    x <- gsub("[ \t]{2}", " ", x) # rm tabs
    x <- gsub("\\s+", " ", x) # rm extra spaces
    x <- trimws(x) # rm leading and trailing white space
    x <- tolower(x) # convert to lower case
tweets_clean <- unlist(clean_tweets(replies))
# If you want to rebombine the text with the metadata (user, time, favorites, retweets)
tweet_data <- data.frame(text=tweets_clean)
tweet_data <- tweet_data[tweet_data$text != "",]
tweet_data$user <-atRLady.df$screenName
tweet_data$time <- atRLady.df$created
tweet_data$favorites <- atRLady.df$favoriteCount
tweet_data$retweets <- atRLady.df$retweetCount
tweet_data$time_bin <- cut.POSIXt(tweet_data$time, breaks="3 hours", labels = FALSE)
tweet_data$isRetweet <- atRLady.df$isRetweet

You can pull other information from the original data frame as well, but I don’t find that information very helpful since it is usually NA (e.g., latitude and longitude). The final data frame should look like this:


Now you can analyze it. For example, you can graph retweets for each reply

ggplot(data = tweet_data, aes(x = retweets)) +
  geom_bar(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Retweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")


If you have more data, you can conduct a sentiment analysis of all the words in the text of the tweets or create a wordcloud (example below)


Overall, using R to collect data from Twitter was really easy. Honestly, it was pretty easy to do in Python too. However, I must say that the R community is slightly better when it comes to sharing resources and blogs that make it easy for beginners to follow what they’ve done. I really love the open source community and I’m excited that I am apart of this movement!

PS- I forgot to announce that I am officially an R-Lady (see directory)! To all my fellow lady friends, I encourage you to join!



Mining Data from Twitter (and replies to Tweets) with Tweepy

Mining Data from Twitter (and replies to Tweets) with Tweepy

I recently met someone who is interested in mining data from Twitter. In addition to mining data from Twitter however, they’re also interested in collecting all of the replies. I thought that I would try giving it a shot and sharing what I learn.

Note: This post assumes that Python is installed on your computer. If you haven’t installed Python, this Python Wiki walks you through the process.

To scrape tweets from Twitter, I recommend using Tweepy, but there are several other options. To install tweepy:

pip install tweepy

*Note: If your environments are configured like mine, you may need to type: conda install -c conda-forge tweepy

Now, go to Twitter’s developer page to register your app (you will have to sign in with your username and password, or sign up with a new username). You should see a button on the right-hand side of the page that says “Create New App.” Fill out the necessary fields (i.e. name of the app; it’s description; your website) and then check the box that says you agree to their terms, which I linked to above. If you don’t have a publicly accessible website, just list the web address that is hosting your app (e.g., link to your school profile; link to your work website). You can likely ignore the Callback URL field, unless you are allowing users to log into your app to authenticate themselves. In which case, enter the URL where they would be returned after they’ve given permission to Twitter to use your app.

After registering your app, you should see a page where you can create your access token. Click the “Create my access token” button. If you don’t see this button after a few seconds, refresh the page. The next page will ask you what type of access you need. For this example, we will need Read, Write, and Access Direct Messages. Now, note your OAuth settings, particularly your Consumer Key, Consumer Secret, OAuth Access Token, and OAuth Access Token Secret. Don’t share this information with anyone!

Next, import tweepy and use the OAuth interface to collect data.

import tweepy
from tweepy import OAuthHandler

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth,  wait_on_rate_limit=True)

To collect tweets that are currently being shared on Twitter. For example, tweets that are under “StarwarsDay”. Note that I’ve asked Python to format the tweets by first listing the the user and screen name, and then their tweet. If you want ALL the meta data, remove the specifications that I wrote.

#you'll need to import json to run this script
import json
class PrintListener(tweepy.StreamListener):
    def on_data(self, data):
        # Decode the JSON data
        tweet = json.loads(data)

        # Print out the Tweet
        print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

    def on_error(self, status):

if __name__ == '__main__':
    listener = PrintListener()

    # Show system message
    print('I will now print Tweets containing "StarWarsDay"! ==>')

    # Authenticate
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # Connect the stream to our listener
    stream = tweepy.Stream(auth, listener)
    stream.filter(track=['StarWarsDay'], async=True)

Here’s a brief glimpse of what I got:

I will now print Tweets containing "StarWarsDay"! ==>
@Amanda33441401: b'RT @givepennyuk: This #StarWarsDay we want you to feel inspired by the force to think up a Star Wars fundraiser!\n\n  Star Wars Movie Marat'
@jin_keapjjang: b'RT @HamillHimself: People around the world are marking #StarWarsDay in spectacular style. #MayThe4thBeWithYou via @'
@Bradleyg1996G: b'RT @NHLonNBCSports: MAY THE PORGS BE WITH YOU\n\n#StarWarsDay #MayThe4thBeWithYou'
@thays_jeronimo: b"RT @g1: 'May the 4th'  celebrado por fs de 'Star Wars' #MayThe4thBeWithYou #StarWarsDay #G1"
@DF_SomersetKY: b"If you're a fan of the franchise, you're going to love all of this Star Wars gear for your car! Tweet us your favor"
@zakrhssn: b'RT @williamvercetti: #StarWarsDay'
@kymaticaa: b'RT @Electric_Forest: May The Forest be with you.  #ElectricForest #StarWarsDay #StarWars'
@hullodave: b'"Only Imperial Stormtroopers are this precise" How precise? Not very? But why? Science! #StarWarsDay'

To store the data you just collected, rather than printing it, you’ll have to add some extra code (see text in red)

import csv
import json
class PrintListener(tweepy.StreamListener):
    def on_data(self, data):
        # Decode the JSON data
        tweet = json.loads(data)

        # Print out the Tweet
        print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

        with open('StarWarsDay.csv','a') as f:

    def on_error(self, status):

if __name__ == '__main__':
    listener = PrintListener()

    # Show system message
    print('I will now print Tweets containing "StarWarsDay"! ==>')

    # Authenticate
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # Connect the stream to our listener
    stream = tweepy.Stream(auth, listener)
    stream.filter(track=['StarWarsDay'], async=True)

Now, let’s see if it’s possible to get replies to a tweet. I checked these users (listed above) and none of them have any replies. So I decided to just search #StarWarsDay on Twitter instead. I immediately found the twitter handle for Arrested Development, which has replies (at the time of this writing 9) to their tweet that includes #StarWarsDay:

Look at the hyperlink: [“”%5D
You can see that the user we are interested in is @bluthquotes and that the id for this particular tweet is “992433028155654144”

To get tweets from just @bluthquotes, you would type

bluthquotes_tweets = api.user_timeline(screen_name = 'bluthquotes', count = 100)

for status in bluthquotes_tweets:

To see all the replies to @bluthquotes tweet posted above:

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)  

for full_tweets in tweepy.Cursor(api.user_timeline,screen_name='bluthquotes',timeout=999999).items(10):
  for tweet in tweepy.Cursor(,q='to:bluthquotes', since_id=992433028155654144, result_type='recent',timeout=999999).items(1000):
    if hasattr(tweet, 'in_reply_to_status_id_str'):
      if (tweet.in_reply_to_status_id_str==full_tweets.id_str):
  print("Tweet :",full_tweets.text.translate(non_bmp_map))
  for elements in replies:
       print("Replies :",elements)

*Note: change the .items(10) line to get more replies. Remember that Twitter limits you to 100 per hour (at least at the time of this writing).

This is what I got:

Tweet : @HulkHogan You ok hermano?
Tweet : Go see a Star War #MayTheForceBeWithYou #StarWarsDay
Replies : @bluthquotes @mlot11
Replies : @bluthquotes Oh my yes
Replies : @bluthquotes Star Wars needs more gay characters #jarjarXobama
Replies : @bluthquotes @TheSAPeacock May the 4th be with you!!!!
Replies : @bluthquotes @SaraAnneGill
Replies : @bluthquotes @kurkobains Por q xuxa la sacaron de #Netflix
Replies : @bluthquotes
Replies : @bluthquotes @hiagorecanello
Replies : @bluthquotes No it’s #CincoDeCuatro
Replies : @bluthquotes @auburnhays 😂
Replies : @bluthquotes Go see a Star War on Cinco de Quatro! 🤠🌶️🍹
Replies : @bluthquotes @jmdroberts
Tweet : RT @arresteddev: Hey, hermanos! It's Cinco de Cuatro! Season 4 Remix is now streaming.
Tweet : Keep fighting little guy #StarWarsDay  #MayThe4thBeWithYou
Replies : @bluthquotes "worth every penny"
Replies : @bluthquotes You’re still doin’ that?
Replies : @bluthquotes Im crying 😭
Replies : @bluthquotes @theJdog 😂😂😂😂😂
Tweet : @JTHM8008 @herooine @JeffEisenband @ZachAJacobson I say HUZZAH! like this at least 5 times a week.
Tweet : @herooine @JeffEisenband @ZachAJacobson Checks out ✅
Tweet : @gjb512 It’s there already. Huzzah!
Tweet : @drkatiemd_ @MitchHurwitz It’s a wonderful program!
Tweet : I prematurely blue myself  #EmbarrassmentIn4Words
Tweet : @sebastrivi @VICE @arresteddev I’m not on board

You can see that it’s not quite what I wanted, which is just the responses to the Star Wars tweet (see red text). According to the API reference page, there should be a way to limit the returned text to the replies to the specific tweet we are interested, but I will have to continue tinkering with it. I’ll post an update when I figure it out.

US Fertility Heat Map DIY

US Fertility Heat Map DIY

The US fertility heat maps that I made a couple of weeks ago received a lot of attention and one of the questions I’ve been asked is how I produced it, which I describe in this post.

As I mentioned in my previous post, I simply followed the directions specified in this article, but I limited the UN data to the US. Overall, I think the article does a good job of explaining how they created their heat map in Tableau. The reason why I remade the heat map in R is because I was just frustrated with the process of trying to embed the visualization into WordPress. Both Tableau and WordPress charge you to embed visualizations in a format that is aesthetically pleasing. Luckily, recreating the heat map in R was extremely easy and just as pretty, at least in my opinion. Here’s how I did it:

First, download the data from the UN website–limit the data to the US only. Alternatively, I’ve linked to the (formatted) data on my OSF account, which also provides access to my code.

Now type the following in Rstudio:

#load libraries:
#if you need to install first, type: install.packages("package_name",dependencies=TRUE)

#set your working directory to the folder your data is stored in
setwd("C:/Users/Stella/Documents/blog/US birth Map")
#if you don't know what directory is currently set to, type: getwd()

#now import your data
us_fertility<-read.csv("USBirthscsv.csv", header=TRUE) #change the file name if you did not use the data I provided (

#limit to relevant data
dta% select(Year, January:December)

#gather (i.e., "aggregate") data of interest, in preparation for graphing

#orderring the data by most frequent incidence of births
dta %>%
group_by(Year) %>%

#plot the data
plot<- ggplot(bb2, aes(x =fct_rev(Month),
y = Year,
fill=rank)) +
scale_x_discrete(name="Months", labels=c("Jan", "Feb", "Mar",
"Apr", "May","Jun",
"Jul", "Aug", "Sep",
"Oct", "Nov", "Dec")) +
scale_fill_viridis(name = "Births", option="magma") + #optional command to change the colors of the heat map
geom_tile(colour = "White", size = 0.4) +
labs(title = "Heat Map of US Births",
subtitle = "Frequency of Births from 1969-2014",
x = "Month",
y = "Year",
caption = "Source: UN Data") +

plot+ aes(x=fct_inorder(Month))

#if you want to save the graph
dev.copy(png, "births.png")

And that’s it! Simple, right?!

Advice on Serving as a Conference Discussant

Advice on Serving as a Conference Discussant

My advisor invited me to serve as the conference discussant for her session on education and health at the annual conferences of the Population Association of America, and I decided share what I did to prepare and my thoughts about the experience. This post is mostly geared toward junior-level discussants (i.e., graduate students or new professors). The approach will be different because, although no one likely wants to offend the presenters, as an early-career researcher, we are potentially in the most precarious position if we do happen to offend anyone with our comments.


The first thing you should do is check if the conference provides some guidelines for discussants. The PAA does provide guidelines. To briefly summarize here, the PAA states that the primarily role of the discussant is to  provide the audience with perspective and insight about the substance and significance of the papers, particularly by highlighting similar themes and emphasizing each paper’s individual contribution to the literature. They suggest reading the papers multiple times in order to note the weaknesses and strengths of the paper, while critically assessing the paper’s assumptions, methods, and the conclusions. Do not focus on too many specifics of the paper because the audience has not necessarily read the entire paper. The overall idea is to be constructive. Offer comments and suggestions that help the author improve their study, while also encouraging the audience to engage in a thoughtful discussion about the paper.

In addition to reading the suggested guidelines for discussants, conduct a search on the internet to see if you can find anything from previous discussants. The blogs that I found mostly gave generic suggestions. For example, “say more with less,” meaning don’t talk too long. Let the audience do most of the talking. I did not find this sort of advice helpful. The best advice I found was from Duck of Minerva. Although this post was written in 2007, I think he does a great job of offering more specific advice about what discussants should say. For example, he suggests that “sometimes [being a good discussant]… involves setting the papers in a broader disciplinary context so as to invite other parts of the discipline into the discussion.” He goes on to suggest however, that “sometimes it just involves going to town against a paper and demolishing its absurdity.” But he recognizes that this is not something a graduate student should do. I agree. Although, I also question if it would matter if a graduate student did this. First, we often lack the experience and insight to even see such shortfalls–therefore doing this would only make us look stupid. Second, this is not how you want to be received in the research community, even if you were brilliant enough to find huge deficits in their papers. Basically, don’t be an uppity early-researcher!

It may also be useful to find videos to view live examples of what others have done (e.g., example 1example 2). I’ll admit that finding videos is surprisingly much more difficult. You will likely have to watch a lot of bad videos before you find one that will be relevant to your role. I was mainly looking for things like the seriousness of the discussant’s tone, how much time they spent summarizing versus critiquing/evaluating, and if they engaged with authors directly after they made comments or if they just left it up to the author to address the comments at the end. Some discussants also opened with discussing the broader state of the field.

My discussion:

After a couple day of conducting Google searches, I decided to open with some facts about the broader state of the field, emphasizing the importance of the research topic (Two sentences). I then briefly summarized the main purpose of each study and their main findings. I left it up to the authors to decide if they wanted to address my comments and suggestions, after I finished discussing their papers in the order that they presented–meaning I didn’t ask them direct questions. I mostly said something to the effect of, “it may be interesting to examine [enter specific topic here] in the future, given that the literature indicates [enter trend of finding here].” If I thought that something was not very clear, I said something like “the study could offer some more details on [the thing].” For example, one of the papers that I read did not offer much detail on their sample–like how they were sampled. Another author was not as clear as they could have been about the way they categorized a certain measure. I limited myself to pointing out two weaknesses and also tried to make two suggestions for future extensions of the study.

Also, at the suggestion of my advisor (who was the chair of the session), I also made a presentation. This is uncommon I think, but I didn’t want to go against the advice of my advisor. Because I included a presentation, I decided to have some fun with it and include animated gifs as my slides (see example below). I like to do this because I feel like it keeps the attention off of me.



Overall, my research paid off and my discussion went well. If I had to do it again though, I wouldn’t have walked into the discussion with a two sentence introduction on the state of the field. I thought that it would a nice way to acknowledge the importance of the topic–i.e., their life’s research–but the audience didn’t seem very receptive to it. I think they already know how important their field is. I also wish I focused more on offering suggestions or interpretations for unexpected findings. I hesitated to do this because this is not my research area. Audience members, however, were happy to do this, regardless of whether or not this is their field.

Understanding Linear Regression

My husband and I were discussing the intuition behind OLS regression today and I decided to share the materials that I generated to break down covariance, correlations, and the linear regression equation. It may help to follow along in the Excel workbook that I did this in (see link).

First, let’s say that we have the following data:

Y X1 X2
1 40 25
2 45 20
1 38 30
3 50 30
2 48 28
3 55 30
3 53 34
4 55 36
4 58 32
3 40 34
5 55 38
3 48 28
3 45 30
2 55 36
4 60 34
5 60 38
5 60 42
5 65 38
4 50 34
3 58 38

We plot the relationship between each X variable and Y, to get a visual look at their relationships:



I also like to look at the relationship in a 3D plot (this plot was made in Rstudio with plotly–see link for tutorial):


To calculate the mean of these data, we would have to sum each column and divide each column by the total number of observations (n=20):

Sum 65 1038 655
N 20 20 20
Mean 3.25 51.9 32.75

Now we need the standard deviation. This is something most students do not find very intuitive unless you force them to calculate this step by step, either by hand or in Excel. You should start by subtracting each observation in each column by it’s respective mean:

(Y-MeanY) (X1-MeanX1) (X2-MeanX2)
1-3.25=-2.25 40-51.9=-11.9 25-32.75=-7.75
2-3.25=-1.25 45-51.9=-6.9 20-32.75=-12.75
1-3.25=-2.25 38-51.9=-13.9 30-32.75=-2.75
3-3.25=-0.25 50-51.9=-1.9 30-32.75=-2.75
2-3.25=-1.25 48-51.9=-3.9 28-32.75=-4.75
3-3.25=-0.25 55-51.9=3.1 30-32.75=-2.75
3-3.25=-0.25 53-51.9=1.1 34-32.75=1.25
4-3.25=0.75 55-51.9=3.1 36-32.75=3.25
4-3.25=0.75 58-51.9=6.1 32-32.75=-0.75
3-3.25=-0.25 40-51.9=-11.9 34-32.75=1.25
5-3.25=1.75 55-51.9=3.1 38-32.75=5.25
3-3.25=-0.25 48-51.9=-3.9 28-32.75=-4.75
3-3.25=-0.25 45-51.9=-6.9 30-32.75=-2.75
2-3.25=-1.25 55-51.9=3.1 36-32.75=3.25
4-3.25=0.75 60-51.9=8.1 34-32.75=1.25
5-3.25=1.75 60-51.9=8.1 38-32.75=5.25
5-3.25=1.75 60-51.9=8.1 42-32.75=9.25
5-3.25=1.75 65-51.9=13.1 38-32.75=5.25
4-3.25=0.75 50-51.9=-1.9 34-32.75=1.25
3-3.25=-0.25 58-51.9=6.1 38-32.75=5.25

Then you will square each value in each column:

(Y-MeanY)^2 (X1-MeanX1)^2 (X2-MeanX2)^2
-2.25^2=5.0625 -11.9^2=141.61 -7.75^2=60.0625
-1.25^2=1.5625 -6.9^2=47.61 -12.75^2=162.5625
-2.25^2=5.0625 -13.9^2=193.21 -2.75^2=7.5625
-0.25^2=0.0625 -1.9^2=3.60999 -2.75^2=7.5625
-1.25^2=1.5625 -3.9^2=15.21 -4.75^2=22.5625
-0.25^2=0.0625 3.1^2=9.610000 -2.75^2=7.5625
-0.25^2=0.0625 1.1^2=1.21 1.25^2=1.5625
0.75^2=0.5625 3.1^2=9.610000 3.25^2=10.5625
0.75^2=0.5625 6.1^2=37.21 -0.75^2=0.5625
-0.25^2=0.0625 -11.9^2=141.61 1.25^2=1.5625
1.75^2=3.0625 3.1^2=9.610000 5.25^2=27.5625
-0.25^2=0.0625 -3.9^2=15.21 -4.75^2=22.5625
-0.25^2=0.0625 -6.9^2=47.61 -2.75^2=7.5625
-1.25^2=1.5625 3.1^2=9.610000 3.25^2=10.5625
0.75^2=0.5625 8.1^2=65.61 1.25^2=1.5625
1.75^2=3.0625 8.1^2=65.61 5.25^2=27.5625
1.75^2=3.0625 8.1^2=65.61 9.25^2=85.5625
1.75^2=3.0625 13.1^2=171.61 5.25^2=27.5625
0.75^2=0.5625 -1.9^2=3.60999 1.25^2=1.5625
-0.25^2=0.0625 6.1^2=37.21 5.25^2=27.5625

If you sum up each column of the squared values, you will get the standard deviation:

SD 29.75 1091.8 521.75

In addition to calculating the mean and standard deviation of Y, X1, and X2, you will also need to calculate the relationships between Y, X1, and X2 by first, multiplying them together, and then repeating each of the steps that we did above:

X1*Y X2*Y X1*X2
1*40=40 1*25=25 25*40=1000
2*45=90 2*20=40 20*45=900
1*38=38 1*30=30 30*38=1140
3*50=150 3*30=90 30*50=1500
2*48=96 2*28=56 28*48=1344
3*55=165 3*30=90 30*55=1650
3*53=159 3*34=102 34*53=1802
4*55=220 4*36=144 36*55=1980
4*58=232 4*32=128 32*58=1856
3*40=120 3*34=102 34*40=1360
5*55=275 5*38=190 38*55=2090
3*48=144 3*28=84 28*48=1344
3*45=135 3*30=90 30*45=1350
2*55=110 2*36=72 36*55=1980
4*60=240 4*34=136 34*60=2040
5*60=300 5*38=190 38*60=2280
5*60=300 5*42=210 42*60=2520
5*65=325 5*38=190 38*65=2470
4*50=200 4*34=136 34*50=1700
3*58=174 3*38=114 38*58=2204

Again, (1) sum each column and (2) divide by the total number of observations (n=20) to get the mean.

Sum 3513 2219 34510
N 20 20 20
Mean 175.65 110.95 1725.5

(3) In a separate table, subtract the respective mean for each column from each row value:

(X1*Y)-(MeanX1*Y) (X2*Y)-(MeanX2*Y) (X1*X2)-(MeanX1*X2)
40-175.65=-135.65 25-110.95=-85.95 1000-1725.5=-725.5
90-175.65=-85.65 40-110.95=-70.95 900-1725.5=-825.5
38-175.65=-137.65 30-110.95=-80.95 1140-1725.5=-585.5
150-175.65=-25.65 90-110.95=-20.95 1500-1725.5=-225.5
96-175.65=-79.65 56-110.95=-54.95 1344-1725.5=-381.5
165-175.65=-10.65 90-110.95=-20.95 1650-1725.5=-75.5
159-175.65=-16.65 102-110.95=-8.95 1802-1725.5=76.5
220-175.65=44.35 144-110.95=33.05 1980-1725.5=254.5
232-175.65=56.35 128-110.95=17.05 1856-1725.5=130.5
120-175.65=-55.65 102-110.95=-8.95 1360-1725.5=-365.5
275-175.65=99.35 190-110.95=79.05 2090-1725.5=364.5
144-175.65=-31.65 84-110.95=-26.95 1344-1725.5=-381.5
135-175.65=-40.65 90-110.95=-20.95 1350-1725.5=-375.5
110-175.65=-65.65 72-110.95=-38.95 1980-1725.5=254.5
240-175.65=64.35 136-110.95=25.05 2040-1725.5=314.5
300-175.65=124.35 190-110.95=79.05 2280-1725.5=554.5
300-175.65=124.35 210-110.95=99.05 2520-1725.5=794.5
325-175.65=149.35 190-110.95=79.05 2470-1725.5=744.5
200-175.65=24.35 136-110.95=25.05 1700-1725.5=-25.5
174-175.65=-1.650000 114-110.95=3.05 2204-1725.5=478.5

(4) Square those values for the standard deviation:

(X1*Y)-(MeanX1*Y)^2 (X2*Y)-(MeanX2*Y)^2 (X1*X2)-(MeanX1*X2)^2
-135.65^2=18400.9225 -85.95^2=7387.4025 -725.5^2=526350.25
-85.65^2=7335.9225 -70.95^2=5033.9025 -825.5^2=681450.25
-137.65^2=18947.5225 -80.95^2=6552.9025 -585.5^2=342810.25
-25.65^2=657.9225 -20.95^2=438.9025 -225.5^2=50850.25
-79.65^2=6344.1225 -54.95^2=3019.5025 -381.5^2=145542.25
-10.65^2=113.4225 -20.95^2=438.9025 -75.5^2=5700.25
-16.65^2=277.2225 -8.95^2=80.1025 76.5^2=5852.25
44.35^2=1966.9225 33.05^2=1092.3025 254.5^2=64770.25
56.35^2=3175.3225 17.05^2=290.7025 130.5^2=17030.25
-55.65^2=3096.9225 -8.95^2=80.1025 -365.5^2=133590.25
99.35^2=9870.4225 79.05^2=6248.9025 364.5^2=132860.25
-31.65^2=1001.7225 -26.95^2=726.3025 -381.5^2=145542.25
-40.65^2=1652.4225 -20.95^2=438.9025 -375.5^2=141000.25
-65.65^2=4309.9225 -38.95^2=1517.1025 254.5^2=64770.25
64.35^2=4140.9225 25.05^2=627.5025 314.5^2=98910.25
124.35^2=15462.9225 79.05^2=6248.9025 554.5^2=307470.25
124.35^2=15462.9225 99.05^2=9810.9025 794.5^2=631230.25
149.35^2=22305.4225 79.05^2=6248.9025 744.5^2=554280.25
24.35^2=592.9225 25.05^2=627.5025 -25.5^2=650.25

(5) Now sum up each column for the standard deviation:

SD 135118.6 56918.95 4279623

Comprehensively, you should get a table like this:

Y X1 X2 X1*Y X2*Y X1*X2
Sum 65 1038 655 3513 2219 34510
N 20 20 20 20 20 20
Mean 3.25 51.9 32.75 175.65 110.95 1725.5
SD 29.75 1091.8 521.75 135118.6 56918.95 4279623

Now we can derive the covariance between each variable, as well as the correlation, using these formulas:

Such that your table should look like this:

y X1 X2
Y 29.75 139.5 90.25
X1 0.77 1091.8 515.5
X2 0.72 0.68 521.75

Notice that the numbers in the diagonals (blue) are the standard deviations that we calculated. The numbers in the bottom triangle (underlined) represent the correlation, and the numbers in the top triangle (red) is the covariance. Below, I show you how I calculated each value in each cell (in Excel):

y X1 X2
Y 29.75 3513-(1038*65)/20=139.5 2219-(655*65)/20=90.25
X1 139.5/SQRT(1091.8*29.75)=0.77403 1091.8 515.5
X2 90.25/SQRT(521.75*29.75)=0.724 515.5/SQRT(1091.8*521.75)=0.683 521.75

Now, you can calculate the betas for X1 and X2 using these formulas:

Notice that you are standardizing each variable by accounting for its covariance with the other predictor. Plugging in each value, you should get the following:

b1 (521.75*139.5-515.5*90.25)/(1091.8*521.75-515.5^2)=0.086
b2 (*1091.8-521.75*515.5)/(0.683*-521.75^2)=0.088

Remember that the formula for OLS regression is simply:

So, using algebra, plug in the variables to calculate the constant:

a 3.25-(0.086*51.9)-(0.088*32.75)=-4.104

Now we have our regression equation:

Y’=4.104 – 0.086*X1 – 0.088*X2 

Now we can calculate our column for y-hat by plugging each X1 and X2 value into the equation. You should get a column that looks something like this:


Now you can plot the actual Y versus predicted Y (i.e., Y’):


There you have it! Hopefully this breakdown provides a better intuition of the numbers behind the OLS regression formula!

Gapminder gif with Rstudio

Gapminder gif with Rstudio

I decided to remake the Gapminder gif that I made the other day in Python, but in Rstudio this time. I’ll probably continue doing this for a while, as I try to figure out the advantages of using one program over the other. Here’s is a walk-through of what I did to recreate it:

#install these packages if you haven't already
install.packages(c("devtools", "dplyr", "ggplot2", "readr"))


#Set up ImageMagick --for gifs
install.packages("installr",dependencies = TRUE)

#Configure your environment--change the location
Sys.setenv(PATH = paste("C:/Program Files/ImageMagick-7.0.7-Q16", Sys.getenv("PATH"), sep = ";")) #change the path to where you installed ImageMagick
#Again, change the location:
magickPath <- shortPathName("C:/Program Files/ImageMagick-7.0.7-Q16/magick.exe")

If you need to download ImageMagick, go to this link

Load data and create plot

Once you’ve installed the appropriate packages and configured ImageMagick to work with Rstudio, you can load your data and plot as usual.

gapminder_data<-read.csv("", header=TRUE)

glimpse(gapminder_data) #print to make sure it loaded correctly
## Observations: 1,704
## Variables: 6
## $ country    Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ year       1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop        8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent  Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ lifeExp    28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ gdpPercap  779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# Helper function for string wrapping. 
# Default 20 character target width.
swr = function(string, nwrap=40) {
  paste(strwrap(string, width=nwrap), collapse="\n")
swr = Vectorize(swr)

gapminder_plot<-ggplot(gapminder_data) +
  aes(x = gdpPercap,
      y = lifeExp,
      colour = continent,
      size = pop, 
      frame=year) +
      scale_x_log10() +
  scale_size_continuous(guide =FALSE) + #suppresses the second legend (size=pop)
  geom_point() +
  scale_color_viridis(discrete=TRUE)+ #optional way to change colors of the plot
  theme_bw() +
  labs(title=swr("Relationship Between Life Expectancy and GDP per Capita"),
       x= "GDP Per Capita",
       y= "Life expectancy",
      caption="Data: Gapminder")
  theme(legend.position = "none",
        plot.caption = element_text(size=.1))</

#getOption("device") #try running this if your plot doesn't immediately show gapminder_plot

#if you want to save the plot:
       plot = last_plot(), # or give ggplot object name as in myPlot,
       width = 5, height = 5, 
       units = "in", # other options c("in", "cm", "mm"), 
       dpi = 300)

Notice that I created the swr function to wrap the title text. If I don’t include that function, the title runs off the plot, like this:


Animate the plot

Now you can animate the plot using gganimate. Also, if you want to change any of the axis-titles or any other feature of the plot, I like to reference STHDA.

#remember to assign a working directory first:
#setwd() <--use this to change the working directory, if needed


All in all, I’d say that creating the gif was equally easy in Python and R. Although I had more trouble initally configuring Python with ImageMagic–I might have found it easier in R simply because I used Python to figure this out the first time.  On the other hand, I like the way the Python gif looks much more than the gif that Rstudio rendered.


Looks like I’ll have to continue experimenting.