Media & Article Roundup 5/20/18

Media & Article Roundup 5/20/18

^image: from https://twitter.com/graykimbrough/status/997172979002986496?utm_source=citylab-daily&silverid=MzUzODY1MjU3MjgxS0

Other

Free Ivy-League MOOCS

The Galton Board

The Rewards and Challenges of Writing for a Mass Media Audience

New way to trade, using the etherium blockchain

Big data limitations

The legacy of Fahrenheit 451

A song about induced demand

Demography

Fertility still on the decline–This article provides a thorough explanation of these trends

Sociology

The Most Intolerant Wins: The Dictatorship of the Small Minority (Chapter from Skin in the Game)–by Nassim Nicholas Taleb

Gender

Women are less likely to write editors than men; and “In many cases, the confidence men have is not particularly warranted.”

“A survey of a random sample of members of the AEA… found that hardly any men believed professional opportunities for economics faculty are tilted against women. Remarkably, about a third believe there is bias in favour of women.”

Women in Tech

Health

Trends in Health Care Spending and US Life Expectancy since the 1980s: A look into the reasons why the US spends more in health care but shows less improvements in health

The public debate about gun policies often ignores a group that is most likely to die by guns: suicidal men.

City Health Dashboard

Environment

Hurricane season is coming, and we are not ready –not exactly the best time to make cuts to climate research

Bitcoin’s energy footprint has more than doubled over the past 6 months

Maps

The US Border Zone

Puerto Rican migration after Hurricane Maria

Homeownership on the decline

Inequality

Trends in graduate student loan debt –much of the difference by race is unexplained

Education is not the most important factor that drives mobility–it’s job opportunities and marriage. But education predicts both job opportunities and marriage.

AI is sexist and racist because we are

Politics

Stereotypes about voter fraud are closely associated with public support for strict punishment of voter fraud violations

Affordable housing may be the next big political issue

Advertisements

Collecting data from Zillow with R

Collecting data from Zillow with R

My mom has been house hunting over the past couple of weeks, so I decided to try and use R to look at the local market. Here’s what I’ve learned:

Collecting data from Zillow was pretty easy, overall. I mostly used R packages rvest, xlm2, and tidyr.

library(rvest)
library(tidyr)
library(xml2)

Next, I went to Zillow and searched for homes in Denver, CO. I zoomed in on an area that I liked and then copied the link and pulled the data in R:

url<-"https://www.zillow.com//homes//for_sale//Denver-CO_rb//?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy"
webpage<-read_html(url)

The next part gets pretty complicated to explain. You essentially have to find the information you want from the webpage,which looks like a bunch of scrambled text. It’s helpful to go back to the webpage, right click, and select “View Page Source.” This will help you identify the structure of the webpage and pull the data you want. I started by parsing out the housing links from the metadata. You’ll have to remove characters to parse out the data, which I show below:

houses<- webpage %>%
  html_nodes(".zsg-pagination a") %>%
  html_attr("href")

houses<-houses[!is.na(houses)]
houses <-strsplit(houses,"/")
houses<-lapply(houses, function(x) x[length(x)])
houses<-as.numeric(gsub('[_p]','',houses))
houses <-max(houses)
urls<-c(url,paste0(url,2:houses,'_p/'))
urls

Then I used Jonkatz2 parser function to strip the data down even further. The rest of his functions didn’t work for me =/

getZillow <- function(urls) {
   lapply(urls, function(u) {
   cat(u, '\n')
   houses <- read_html(u) %>%
              html_nodes("article") houses })
 }
zdata<- getZillow(urls)

Instead, I ended breaking down different parts of his function to get the data that I need. The reason I had to write all of this complicated syntax is because the data is saved in a list within lists.

#to pull ID
getID<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ids<-num %>% html_attr("id")
  })
}
id<-getID(zdata)

#get latitude
getLAT<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    lat<-num %>% html_attr("data-latitude")
    
  })
}
lats<-getLAT(zdata)

#get longitude
getLONG<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    long<-num %>% html_attr("data-longitude")
    
  })
}
longs<-getLONG(zdata)

#get price
getPrice<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    price<-num %>%  
      html_node(".zsg-photo-card-price") %>%
      html_text() 
  })
}
price<-getPrice(zdata)

#house description
getHdesc<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    Hdesc<-num %>%  
      html_node(".zsg-photo-card-info") %>%
      html_text() %>%
      strsplit("\u00b7")
  })
}
hdesc<-getHdesc(zdata)

#needs to be stripped down further
hdesc[[1]][[1]]
ldata2<-length(hdesc[[ldata]])

beds<-list()
getBeds<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
     beds[[i]]<- t1 %>%
       purrr::map_chr(1)
  }
  return(beds)
}
beds<-getBeds(hdesc)

baths<-list()
getBath<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    baths[[i]]<- t1 %>%
      purrr::map_chr(2)
  }
  return(baths)
}
baths<-getBath(hdesc)

sqft<-list()
getSQft<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    sqft[[i]]<- t1 %>%
      purrr::map_chr(3)
  }
  return(sqft)
}
sqft<-getSQft(hdesc)

#house type
getHtype<- function(data) {
    ldata<-1:length(data)
    lapply(1:length(ldata), function(x) {
      num<-data[[x]]
      Htype<-num %>%  
        html_node(".zsg-photo-card-spec") %>%
        html_text()
  })
}
htype<-getHtype(zdata)

#address
getAddy<-function(data) {
  ldata<- 1:length(data)
  lapply(1:length(ldata),function(x) {
    num<-data[[x]]
    addy<- num %>%
      html_nodes(".zsg-photo-card-address") %>%
    html_text() %>%
      strsplit("\u00b7")
  })
}

address<-getAddy(zdata)

#listing type
getLtype<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ltype<-num %>% html_attr("data-pgapt")
    
  })
}
list_type<-getLtype(zdata)

Now you can unlist one level:

address<-lapply(address, function(x) unlist(x))
htype<-lapply(htype, function(x) unlist(x))
id<-lapply(id, function(x) unlist(x))
lats<-lapply(lats,function(x) unlist(x))
longs<-lapply(longs,function(x) unlist(x))
list_type<-lapply(list_type,function(x) unlist(x))
price<-lapply(price,function(x) unlist(x))

Then, I put it all in a data frame:

df<-data.frame()
list<-list(id, price, address, beds, baths, sqft, list_type,longs, lats, htype)
makeList<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ll<-num %>% unlist(recursive=FALSE) 
  })
}
List<-makeList(list)
df<-data.frame(id=c(List[[1]]), price=c(List[[2]]), address=c(List[[3]]),
               beds=c(List[[4]]), baths=c(List[[5]]), sqft=c(List[[6]]),
               l_type=c(List[[7]]), long=c(List[[8]]), lat=c(List[[9]]),
               h_type=c(List[[10]]))

Some of these variables are not correctly formatted. For example, latitude and longitude values were stripped of their decimal points, so I need to add them back in by first removing the factor formatting and then doing some division.

df$long <-as.numeric(as.character(df$long)) / 1000000
df$lat<-as.numeric(as.character(df$lat)) / 1000000

Also, some of my other variables have characters in them, so I want to remove that too:

df$beds <-as.numeric(gsub("[^0-9]", "",df$beds, ignore.case = TRUE))
df$baths <-as.numeric(gsub("[^0-9]", "",df$baths, ignore.case = TRUE))
df$sqft <-as.numeric(gsub("[^0-9]", "",df$sqft, ignore.case = TRUE))
df$price <-as.numeric(gsub("[^0-9]", "",df$price, ignore.case = TRUE))
#replace NAs with 0
df[is.na(df)]<-0

Now I can map my data, in addition to conducting any analyses that I may want to do. Since there’s a ton of stuff out there on conducting analyses in R, I’ll just show you how I mapped my data using the leaflet package:

library(leaflet)
m <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng=df$long, lat=df$lat, popup=df$id) 
m

The should look like this:

DenverZillow

If you click on the markers, they will show you the house IDs that they are associated with. You can see the web version by going to my OSF account, where I also posted the R program that I used.

Media & Article Roundup 5/13/18

Media & Article Roundup 5/13/18

^Image is from the a Pew Report on incarceration (link below)

Data Viz & Design

Map of worst D.C. intersections for pedestrians –it’s from 2017, but resurfacing due to this report on rising pedestrian fatalities. This topic was also covered in CityLab

The average time of commuting to work is increasing

Vintage Road Atlases

Science, Tech & Society

An essay arguing that Google is implicitly training children to accept surveillance

Music in the age of Spotify

The rent is too damn high, because of Airbnb!

Is Facebook a media company?

Are we data engineering/enabling violence?

Increasing number of retractions of scientific journal articles (*They should have used rate of retractions rather than showing total number of retractions, given the rising number of journals over the same period)

Great breakdown of Holocaust beliefs vs. media coverage

The rise of News Deserts in the US

Demography & Development

U.S. life expectancy varies by more than 20 years from county to county

The Success Sequence – and What It Leaves Out

Africa’s path to development through its service sector

The US still has the highest incarceration rates, relative to the rest of the world

Gender

The redistribution of sex is not a thing.

“If we can work together to stop sexual harassment in the workplace … women will only have to deal with harassment all the time at every other place they go.”

Race & Politics

Tolerance is not a moral absolute; it is a peace treaty

“Reductive seduction” is not malicious, but it can be reckless

A Video Explaining the Hidden Meanings Behind Childish Gambino’s ‘This Is America’

Education

What is the “true” high school graduation rate?

Economy

Labor movements around the world

Visualizing the middle class

Other

Amazing illustrations by Glenn Harvey

Collecting Twitter Data using the twitteR package in Rstudio

Collecting Twitter Data using the twitteR package in Rstudio

Last week, I wrote a blog post about collecting data using Tweepy in Python. Like usual, I decided to recreate my work in R, so that I can compare my experience using different analytical tools. I will walk you through what I did, but I assume that you already have Rstudio installed. If not, and you wish to follow along, here’s a link to a good resource that explains how to download and install Rstudio.

Begin by loading the following libraries–download them if you don’t have them already installed.

#To download:
#install.packages(c("twitteR", "purrr", "dplyr", "stringr"),dependencies=TRUE)

library(twitteR)
library(purrr)
suppressMessages(library(dplyr))
library(stringr)

Next, initiate the OAuth protocol. This of course assumes that you have registered your Twitter app. If not, here’s a link that explains how to do this.


api_key <- "your_consumer_api_key"
api_secret <-"your_consumer_api_secret"
token <- "your_access_token"
token_secret <- "your_access_secret"

setup_twitter_oauth(api_key, api_secret, token, token_secret)

Now you can use the package twitteR to collect the information that you want. For example, #rstats or #rladies <–great hashtags to follow on Twitter, btw 😉

tw = searchTwitter('#rladies + #rstats', n = 20)

which will return a list of (20) tweets that contain the two search terms that I specified:

RstatsRlaides

*If you want more than 20 tweets, simply increase the number following n=

Alternatively, you can collect data on a specific user. For example, I am going to collect tweets from this awesome R-Lady, @Lego_RLady:

Again, using the twitteR package, type the following:


LegoRLady <- getUser("LEGO_RLady") #for info on the user
RLady_tweets<-userTimeline("LEGO_RLady",n=30,retryOnRateLimit=120) #to get tweets
tweets.df<-twListToDF(RLady_tweets) #turn into data frame
write.csv(tweets.df, "Rlady_tweets.csv", row.names = FALSE) #export to Excel

Luckily, she only has 27 tweets total. If you are collecting tweets from a user that has been on Twitter for longer, you’ll likely have to use a loop to continue collecting every tweet because of the rate limit. If you export to Excel, you should see something like this:
Excel

*Note: I bolded the column names and created the border to help distinguish the data

If you’re interested in the retweets and replies to @LEGO_RLady, then you can search for that specifically. To limit the amount of data, let’s limit it to any replies since the following tweet:

target_tweet<-"991771358634889222"
atRLady <- searchTwitter("@LEGO_RLady", 
                       sinceID=target_tweet, n=25, retryOnRateLimit = 20)
atRLady.df<-twListToDF(atRLady)

The atRLady.df data frame should look like this:

atRlady

There’s much more data if you scroll right. You should have 16 variables total.

Sometimes there are characters in the tweet that result in errors. To make sure that the tweet is in plain text, you can do the following:

replies <- unlist(atRLady) #make sure to use the list and not the data frame

#helper function to remove characters:
clean_tweets <- function (tweet_list) {
  lapply(tweet_list, function (x) {
    x <- x$getText() # get text alone
    x <- gsub("&amp", "", x) # rm ampersands
    x <- gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*) ?", "", x) # rm links
    x <- gsub("#\\w+", "", x) # rm hashtags
    x <- gsub("@\\w+", "", x) # rm usernames
    x <- iconv(x, "latin1", "ASCII", sub="") # rm emojis
    x <- gsub("[[:punct:]]", "", x) # rm punctuation
    x <- gsub("[[:digit:]]", "", x) # rm numbers
    x <- gsub("[ \t]{2}", " ", x) # rm tabs
    x <- gsub("\\s+", " ", x) # rm extra spaces
    x <- trimws(x) # rm leading and trailing white space
    x <- tolower(x) # convert to lower case
  })
}
tweets_clean <- unlist(clean_tweets(replies))
# If you want to rebombine the text with the metadata (user, time, favorites, retweets)
tweet_data <- data.frame(text=tweets_clean)
tweet_data <- tweet_data[tweet_data$text != "",]
tweet_data<-data.frame(tweet_data)
tweet_data$user <-atRLady.df$screenName
tweet_data$time <- atRLady.df$created
tweet_data$favorites <- atRLady.df$favoriteCount
tweet_data$retweets <- atRLady.df$retweetCount
tweet_data$time_bin <- cut.POSIXt(tweet_data$time, breaks="3 hours", labels = FALSE)
tweet_data$isRetweet <- atRLady.df$isRetweet

You can pull other information from the original data frame as well, but I don’t find that information very helpful since it is usually NA (e.g., latitude and longitude). The final data frame should look like this:

cleantweets

Now you can analyze it. For example, you can graph retweets for each reply

library(ggplot2)
ggplot(data = tweet_data, aes(x = retweets)) +
  geom_bar(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Retweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
dev.copy(png,'myplot.png')
dev.off()

myplot

If you have more data, you can conduct a sentiment analysis of all the words in the text of the tweets or create a wordcloud (example below)

BigDataWordMap-1264x736

Overall, using R to collect data from Twitter was really easy. Honestly, it was pretty easy to do in Python too. However, I must say that the R community is slightly better when it comes to sharing resources and blogs that make it easy for beginners to follow what they’ve done. I really love the open source community and I’m excited that I am apart of this movement!

PS- I forgot to announce that I am officially an R-Lady (see directory)! To all my fellow lady friends, I encourage you to join!

 

 

Media & Article Roundup 5/6/18

Media & Article Roundup 5/6/18

^image is of hit and run fatalities, featured in a new AARP report (link below)

Demography

Census Bureau features data on the Asian-American and Pacific Islander population

A majority of residents support the right of same-sex couples to get married in 44 states–and the majority of those who oppose it, are conservative republicans

Small metros are growing

Interactive map of segregation in the US

Declining home ownership rates in the US

Gen Z is the loneliest generation

Interactive Map of Food Insecurity in the US

Rising youth unemployment levels around the world

Health

Affordable Housing and Population Health

Mapping the Opiod Crisis

Brace yourself. The insects are coming, and they’re bringing diseases with them.

New report from AARP shows that hit-and-run fatalities are concentrated in the south

Culture

How Ferrero Rocher emerged as a status symbol among immigrant families

Political views and privacy concerns

Visualizing the structure of a comedy routine –has audio

All American Nazis

Inequality

Who is in the working class?

The Matthew Effect in Science Funding

The economic returns to cultural capital

Gender

Women are fleeing the tech industry

Gender and Self-citation across Fields and over Time

Female freelancers are less likely to be paid on time than male freelancers

Decreasing the gender pay gap: How to answer questions about your salary at your current/former job

The gender gap in economics

Videos:

Envisioning higher-order dimensions

Defusing the population bomb

Visualizing the birthday paradox –not really a video, but interesting nonetheless. This paradox, however, assumes that birthdays are randomly distributed when they’re not (see previous post)

Mining Data from Twitter (and replies to Tweets) with Tweepy

Mining Data from Twitter (and replies to Tweets) with Tweepy

I recently met someone who is interested in mining data from Twitter. In addition to mining data from Twitter however, they’re also interested in collecting all of the replies. I thought that I would try giving it a shot and sharing what I learn.

Note: This post assumes that Python is installed on your computer. If you haven’t installed Python, this Python Wiki walks you through the process.

To scrape tweets from Twitter, I recommend using Tweepy, but there are several other options. To install tweepy:

pip install tweepy

*Note: If your environments are configured like mine, you may need to type: conda install -c conda-forge tweepy

Now, go to Twitter’s developer page to register your app (you will have to sign in with your username and password, or sign up with a new username). You should see a button on the right-hand side of the page that says “Create New App.” Fill out the necessary fields (i.e. name of the app; it’s description; your website) and then check the box that says you agree to their terms, which I linked to above. If you don’t have a publicly accessible website, just list the web address that is hosting your app (e.g., link to your school profile; link to your work website). You can likely ignore the Callback URL field, unless you are allowing users to log into your app to authenticate themselves. In which case, enter the URL where they would be returned after they’ve given permission to Twitter to use your app.

After registering your app, you should see a page where you can create your access token. Click the “Create my access token” button. If you don’t see this button after a few seconds, refresh the page. The next page will ask you what type of access you need. For this example, we will need Read, Write, and Access Direct Messages. Now, note your OAuth settings, particularly your Consumer Key, Consumer Secret, OAuth Access Token, and OAuth Access Token Secret. Don’t share this information with anyone!

Next, import tweepy and use the OAuth interface to collect data.

import tweepy
from tweepy import OAuthHandler

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth,  wait_on_rate_limit=True)

To collect tweets that are currently being shared on Twitter. For example, tweets that are under “StarwarsDay”. Note that I’ve asked Python to format the tweets by first listing the the user and screen name, and then their tweet. If you want ALL the meta data, remove the specifications that I wrote.

#you'll need to import json to run this script
import json
class PrintListener(tweepy.StreamListener):
    def on_data(self, data):
        # Decode the JSON data
        tweet = json.loads(data)

        # Print out the Tweet
        print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

    def on_error(self, status):
        print(status)


if __name__ == '__main__':
    listener = PrintListener()

    # Show system message
    print('I will now print Tweets containing "StarWarsDay"! ==>')

    # Authenticate
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # Connect the stream to our listener
    stream = tweepy.Stream(auth, listener)
    stream.filter(track=['StarWarsDay'], async=True)

Here’s a brief glimpse of what I got:

I will now print Tweets containing "StarWarsDay"! ==>
@Amanda33441401: b'RT @givepennyuk: This #StarWarsDay we want you to feel inspired by the force to think up a Star Wars fundraiser!\n\n  Star Wars Movie Marat'
@jin_keapjjang: b'RT @HamillHimself: People around the world are marking #StarWarsDay in spectacular style. #MayThe4thBeWithYou https://t.co/BM02D965Xa via @'
@Bradleyg1996G: b'RT @NHLonNBCSports: MAY THE PORGS BE WITH YOU\n\n#StarWarsDay #MayThe4thBeWithYou https://t.co/HWAmptYND5'
@thays_jeronimo: b"RT @g1: 'May the 4th'  celebrado por fs de 'Star Wars' https://t.co/ggNhaEQCPV #MayThe4thBeWithYou #StarWarsDay #G1 https://t.co/wUY74DZL"
@DF_SomersetKY: b"If you're a fan of the franchise, you're going to love all of this Star Wars gear for your car! Tweet us your favor https://t.co/PteAtqS1Ui"
@zakrhssn: b'RT @williamvercetti: #StarWarsDay https://t.co/fgHZzTZ0Fm'
@kymaticaa: b'RT @Electric_Forest: May The Forest be with you.  #ElectricForest #StarWarsDay #StarWars https://t.co/bfQnZHI8eX'
@hullodave: b'"Only Imperial Stormtroopers are this precise" How precise? Not very? But why? Science! #StarWarsDay https://t.co/niZ2h6ssnp'

To store the data you just collected, rather than printing it, you’ll have to add some extra code (see text in red)

import csv
import json
class PrintListener(tweepy.StreamListener):
    def on_data(self, data):
        # Decode the JSON data
        tweet = json.loads(data)

        # Print out the Tweet
        print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

        with open('StarWarsDay.csv','a') as f:
                f.write(data)
        except:
            pass

    def on_error(self, status):
        print(status)


if __name__ == '__main__':
    listener = PrintListener()

    # Show system message
    print('I will now print Tweets containing "StarWarsDay"! ==>')

    # Authenticate
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # Connect the stream to our listener
    stream = tweepy.Stream(auth, listener)
    stream.filter(track=['StarWarsDay'], async=True)

Now, let’s see if it’s possible to get replies to a tweet. I checked these users (listed above) and none of them have any replies. So I decided to just search #StarWarsDay on Twitter instead. I immediately found the twitter handle for Arrested Development, which has replies (at the time of this writing 9) to their tweet that includes #StarWarsDay:

Look at the hyperlink: [“https://twitter.com/bluthquotes/status/992433028155654144”%5D
You can see that the user we are interested in is @bluthquotes and that the id for this particular tweet is “992433028155654144”

To get tweets from just @bluthquotes, you would type

bluthquotes_tweets = api.user_timeline(screen_name = 'bluthquotes', count = 100)

for status in bluthquotes_tweets:
print(status)

To see all the replies to @bluthquotes tweet posted above:


replies=[] 
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)  

for full_tweets in tweepy.Cursor(api.user_timeline,screen_name='bluthquotes',timeout=999999).items(10):
  for tweet in tweepy.Cursor(api.search,q='to:bluthquotes', since_id=992433028155654144, result_type='recent',timeout=999999).items(1000):
    if hasattr(tweet, 'in_reply_to_status_id_str'):
      if (tweet.in_reply_to_status_id_str==full_tweets.id_str):
        replies.append(tweet.text)
  print("Tweet :",full_tweets.text.translate(non_bmp_map))
  for elements in replies:
       print("Replies :",elements)
  replies.clear()

*Note: change the .items(10) line to get more replies. Remember that Twitter limits you to 100 per hour (at least at the time of this writing).

This is what I got:

Tweet : @HulkHogan You ok hermano?
Tweet : Go see a Star War #MayTheForceBeWithYou #StarWarsDay https://t.co/OLcmCAEl30
Replies : @bluthquotes @mlot11
Replies : @bluthquotes Oh my yes
Replies : @bluthquotes Star Wars needs more gay characters #jarjarXobama
Replies : @bluthquotes @TheSAPeacock May the 4th be with you!!!!
Replies : @bluthquotes @SaraAnneGill
Replies : @bluthquotes @kurkobains Por q xuxa la sacaron de #Netflix
Replies : @bluthquotes  https://t.co/Qv4KJJ6dFU
Replies : @bluthquotes @hiagorecanello
Replies : @bluthquotes No it’s #CincoDeCuatro
Replies : @bluthquotes @auburnhays 😂
Replies : @bluthquotes Go see a Star War on Cinco de Quatro! 🤠🌶️🍹
Replies : @bluthquotes @jmdroberts
Tweet : RT @arresteddev: Hey, hermanos! It's Cinco de Cuatro! Season 4 Remix is now streaming. https://t.co/Alw0Z2Zwlm
Tweet : Keep fighting little guy #StarWarsDay  #MayThe4thBeWithYou https://t.co/Uim4D2BP49
Replies : @bluthquotes "worth every penny"
Replies : @bluthquotes You’re still doin’ that?
Replies : @bluthquotes Im crying 😭
Replies : @bluthquotes @theJdog 😂😂😂😂😂
Tweet : @JTHM8008 @herooine @JeffEisenband @ZachAJacobson I say HUZZAH! like this at least 5 times a week.
Tweet : @herooine @JeffEisenband @ZachAJacobson Checks out ✅
Tweet : @gjb512 It’s there already. Huzzah!
Tweet : @drkatiemd_ @MitchHurwitz It’s a wonderful program!
Tweet : I prematurely blue myself  #EmbarrassmentIn4Words https://t.co/QYUFeSKFT2
Tweet : @sebastrivi @VICE @arresteddev I’m not on board

You can see that it’s not quite what I wanted, which is just the responses to the Star Wars tweet (see red text). According to the API reference page, there should be a way to limit the returned text to the replies to the specific tweet we are interested, but I will have to continue tinkering with it. I’ll post an update when I figure it out.

US Fertility Heat Map DIY

US Fertility Heat Map DIY

The US fertility heat maps that I made a couple of weeks ago received a lot of attention and one of the questions I’ve been asked is how I produced it, which I describe in this post.

As I mentioned in my previous post, I simply followed the directions specified in this article, but I limited the UN data to the US. Overall, I think the article does a good job of explaining how they created their heat map in Tableau. The reason why I remade the heat map in R is because I was just frustrated with the process of trying to embed the visualization into WordPress. Both Tableau and WordPress charge you to embed visualizations in a format that is aesthetically pleasing. Luckily, recreating the heat map in R was extremely easy and just as pretty, at least in my opinion. Here’s how I did it:

First, download the data from the UN website–limit the data to the US only. Alternatively, I’ve linked to the (formatted) data on my OSF account, which also provides access to my code.

Now type the following in Rstudio:


#load libraries:
#if you need to install first, type: install.packages("package_name",dependencies=TRUE)
library(tidyverse)
library(viridis)

#set your working directory to the folder your data is stored in
setwd("C:/Users/Stella/Documents/blog/US birth Map")
#if you don't know what directory is currently set to, type: getwd()

#now import your data
us_fertility<-read.csv("USBirthscsv.csv", header=TRUE) #change the file name if you did not use the data I provided (osf.io/h9ta2)

#limit to relevant data
dta% select(Year, January:December)

#gather (i.e., "aggregate") data of interest, in preparation for graphing
dta%
arrange(Year)

#orderring the data by most frequent incidence of births
dta %>%
group_by(Year) %>%
mutate(rank=dense_rank(desc(births)))

#plot the data
plot<- ggplot(bb2, aes(x =fct_rev(Month),
y = Year,
fill=rank)) +
scale_x_discrete(name="Months", labels=c("Jan", "Feb", "Mar",
"Apr", "May","Jun",
"Jul", "Aug", "Sep",
"Oct", "Nov", "Dec")) +
scale_fill_viridis(name = "Births", option="magma") + #optional command to change the colors of the heat map
geom_tile(colour = "White", size = 0.4) +
labs(title = "Heat Map of US Births",
subtitle = "Frequency of Births from 1969-2014",
x = "Month",
y = "Year",
caption = "Source: UN Data") +
theme_tufte()

plot+ aes(x=fct_inorder(Month))

#if you want to save the graph
dev.copy(png, "births.png")
dev.off()

 
And that’s it! Simple, right?!