2018 World Population Day!

2018 World Population Day!

Happy World Population Day!

In case you’re unfamiliar with World Population Day, it started in 1989 by the Governing Council of the United Nations Development Programme. July 11th was chosen because in 1987, it marked the approximate date in which the world’s population reached 5 billion people. The purpose of World Population Day is to draw attention to issues related to the global population, including the implications of population growth on the environment, economic development, gender equality, education, poverty, and human rights. The latter issue is the theme celebrated this year. Specifically, family planning as a human right, as this year marks the 50th anniversary of the 1968 International Conference on Human Rights, where family planning was for the first time globally affirmed to be a human right.

This year, the world population is estimated to be around 7,632,819,325 people. If you want to see estimates of the global population in real time, you can visit the Worldometers website, which will also show other interesting estimates of population-related statistics, such as healthcare expenditures, energy consumption, and water use. If you’re interested in where they get their data and their methods, you can visit their FAQ section.

World Population #Rstats Edition

In celebration of World Population Day, I thought I would share an R program that pulls data from the Worldometers site:


and creates a world map that highlights the top 10 countries with the largest total populations:

The top 10 countries with the largest total populations is highlighted in dark green.

R code below:

#Load libraries
#Retrieve data:
html.global_pop <- read_html("http://www.worldometers.info/world-population/population-by-country/")

#Create dataframe
df.global_pop_RAW <- html.global_pop %>%
  html_nodes("table") %>%
  extract2(1) %>%

#Check data

#Check for unnecessary spaces in values

#Check if country names match those in the map package
as.factor(df.global_pop_RAW$`Country (or dependency)`) %>% levels()

#Renaming countries to match how they are named in the package
df.global_pop_RAW$`Country (or dependency)` <- recode(df.global_pop_RAW$`Country (or dependency)`
                                   ,'U.S.' = 'USA'
                                   ,'U.K.' = 'UK')

#Convert population to numeric--you have to remove the "," before converting 
df.global_pop_RAW$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop_RAW$`Population (2018)` ))))
sapply(df.global_pop_RAW,class) #Check that it worked

#Generate a world map
world_map<- map_data('world')

#Join map data with our data
map.world_joined <- left_join(world_map, df.global_pop_RAW, 
                              by = c('region' = 'Country (or dependency)'))

#Take only top 10 countries
df.global_pop10 <- df.global_pop_RAW %>%

#Printing to check

#Change data to numeric
df.global_pop10$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop10$`Population (2018)` ))))

#Check if worked correctly

#Join map data to our data
map.world_joined2 <- left_join(world_map, df.global_pop10, 
                              by = c('region' = 'Country (or dependency)'))

#Create Flag to indicate that it will be colored in for the map
map.world_joined2 <- map.world_joined2 %>%
  mutate(tofill2 = ifelse(is.na(`#`), F, T))

#Now generate the map
ggplot() +
  geom_polygon(data = map.world_joined2, 
               aes(x = long, y = lat, group = group, fill = tofill2)) +
  scale_fill_manual(values = c("lightcyan2","darkturquoise")) +
  labs(title =  'Top 10 Countries with Largest populations (2018)'
       ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") +
  theme_minimal() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 16)
        ,plot.caption = element_text(size = 5)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,legend.position = "none"

Alternatively, you could include all the countries and use a gradient to indicate population size. However, China and India’s population is so large relative to other countries that it becomes difficult to see any real comparison.

#Generate map data (again)
world_map<- map_data('world')

#re-join with data
map.world_joined <- left_join(world_map, df.global_pop_RAW, 
                              by = c('region' = 'Country (or dependency)'))

#flag to fill ALL countries that match with the map package
map.world_joined <- map.world_joined %>%
  mutate(tofill = ifelse(is.na(`#`), F, T))

#Check that it worked correctly

#Then generate new map
ggplot(data = map.world_joined, aes(x = long, y = lat, group = group), color="white", size=.001) +
  geom_polygon(aes(x = long, y = lat, group = group, fill = `Population (2018)`)) +
  scale_fill_viridis(option = 'magma') +
  labs(title =  'Top 10 Countries with Largest populations'
       ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") +
  theme_minimal() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 18)
        ,plot.caption = element_text(size = 5)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()

Which should produce this map:
You can see that the other countries that made the top 10 list are not black, which reflects the smallest population sizes, but this map really just highlights how large China and India’s population are relative to the other countries.

More population data and viz

If you want to know more about the global population and how it has changed over time, here are some great resources:

Our World in Data— see also their estimates for future population growth

8 min PBS video

Hans Rosling Tedx video (10 min)

20 min Hans Rosling video –this uses the gapminder data I often code with in Python and in R

Kurzgsagt animated video (6.5 min)

If you’re interested in theories and analytical concepts of demography, here are some links to free online class material:

Johns Hopkins Demographic Methods –or here

Johns Hopkins Principles of Population Change




Gapminder gif with Rstudio

Gapminder gif with Rstudio

I decided to remake the Gapminder gif that I made the other day in Python, but in Rstudio this time. I’ll probably continue doing this for a while, as I try to figure out the advantages of using one program over the other. Here’s is a walk-through of what I did to recreate it:

#install these packages if you haven't already
install.packages(c("devtools", "dplyr", "ggplot2", "readr"))


#Set up ImageMagick --for gifs
install.packages("installr",dependencies = TRUE)

#Configure your environment--change the location
Sys.setenv(PATH = paste("C:/Program Files/ImageMagick-7.0.7-Q16", Sys.getenv("PATH"), sep = ";")) #change the path to where you installed ImageMagick
#Again, change the location:
magickPath <- shortPathName("C:/Program Files/ImageMagick-7.0.7-Q16/magick.exe")

If you need to download ImageMagick, go to this link

Load data and create plot

Once you’ve installed the appropriate packages and configured ImageMagick to work with Rstudio, you can load your data and plot as usual.

gapminder_data<-read.csv("https://python-graph-gallery.com/wp-content/uploads/gapminderData.csv", header=TRUE)

glimpse(gapminder_data) #print to make sure it loaded correctly
## Observations: 1,704
## Variables: 6
## $ country    Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ year       1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop        8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent  Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ lifeExp    28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ gdpPercap  779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# Helper function for string wrapping. 
# Default 20 character target width.
swr = function(string, nwrap=40) {
  paste(strwrap(string, width=nwrap), collapse="\n")
swr = Vectorize(swr)

gapminder_plot<-ggplot(gapminder_data) +
  aes(x = gdpPercap,
      y = lifeExp,
      colour = continent,
      size = pop, 
      frame=year) +
      scale_x_log10() +
  scale_size_continuous(guide =FALSE) + #suppresses the second legend (size=pop)
  geom_point() +
  scale_color_viridis(discrete=TRUE)+ #optional way to change colors of the plot
  theme_bw() +
  labs(title=swr("Relationship Between Life Expectancy and GDP per Capita"),
       x= "GDP Per Capita",
       y= "Life expectancy",
      caption="Data: Gapminder")
  theme(legend.position = "none",
        plot.caption = element_text(size=.1))</

#getOption("device") #try running this if your plot doesn't immediately show gapminder_plot

#if you want to save the plot:
       plot = last_plot(), # or give ggplot object name as in myPlot,
       width = 5, height = 5, 
       units = "in", # other options c("in", "cm", "mm"), 
       dpi = 300)

Notice that I created the swr function to wrap the title text. If I don’t include that function, the title runs off the plot, like this:


Animate the plot

Now you can animate the plot using gganimate. Also, if you want to change any of the axis-titles or any other feature of the plot, I like to reference STHDA.

#remember to assign a working directory first:
#setwd() <--use this to change the working directory, if needed


All in all, I’d say that creating the gif was equally easy in Python and R. Although I had more trouble initally configuring Python with ImageMagic–I might have found it easier in R simply because I used Python to figure this out the first time.  On the other hand, I like the way the Python gif looks much more than the gif that Rstudio rendered.


Looks like I’ll have to continue experimenting.

Data Visualization in Python

Data Visualization in Python

Sharing a visualization that I made with Python, in Jupyter Notebook.

First, import the following libraries:

# Set up libraries
%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
Then import data and make scatter plots for each year of life expectancy data, courtesy of Gapminder:
# Get Gapminder Life Expectancy data (csv file is hosted on the web)
url = 'https://python-graph-gallery.com/wp-content/uploads/gapminderData.csv'
data = pd.read_csv(url)
# Transform Continent into numerical values group1->1, group2->2...
# For each year:
for i in data.year.unique():
# initialize a figure
fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi)
# Change color for the x-axis values
tmp=data[ data.year == i ]
plt.scatter(tmp['lifeExp'], tmp['gdpPercap'] , s=tmp['pop']/200000 , c=tmp['continent'].cat.codes, cmap="Accent", alpha=0.6, edgecolors="white", linewidth=2)
# Add titles (main and on axis)
plt.xlabel("Life Expectancy")
plt.ylabel("GDP per Capita")
plt.title("Year: "+str(i) )
plt.xlim(30, 90)
# Save the results
plt.savefig(filename, dpi=96)

Next, download and install ImageMagick to make the following gif, by typing in your (Windows 10) command prompt:

magick convert.exeGapminder*.png animated_gapminder.gif

–If you have any issues configuring ImageMagick, like I did, you may find this link useful.


Note: You can make a gif using Matplotlib or moviepy, but I couldn’t quite figure it out. I will update once I do.

US Fertility Trends by Month and Year

I came across this beautiful data visualization heat map of live births by month and country/region, and I decided to recreate it for the US but by year. The figure below shows the frequency of births from 1972 to 2014, with darker boxes indicating higher incidences of fertility. The data is from the UN. (PS-Tableau and WordPress are annoying–if you don’t want to pay for extras. I can’t embed my table without the play sign in front it. AND it has a broken link *sigh*. So, please see twitter link ^_^).

I was somewhat surprised by how much the timing of fertility by month has stayed relatively consistent over the past 40 years. I expected more variability starting around the 1980s, as marriage rates declined and nonmarital fertility increased. However, the most common birth months in the US has consistently remained July through October, suggesting most babies are conceived through late fall and early winter. Looks like holidays are good for baby-making.

My thoughts on the 2020 Census

The question of citizenship should not be included on the census.

The Census Bureau increasingly faces unprecedented challenges in accurately enumerating the US population. Some of these challenges include monetary costs, unequal access to the internet, cyber security threats, and decreasing trust in our government institutions. Adding the question of citizenship will likely increase these challenges by sowing further distrust among immigrant communities, thereby making it even more difficult for the Census Bureau to accurately count our population. Although demographers and policymakers are unsure of the full extent to which this will affect American communities, we know for certain that this will indeed have repercussions for those who need government services the most.

So let’s break these claims down. The first claim is that adding a question of citizenship will likely reduce the accuracy of the Census. (If you want to read about some background history on the question of citizenship and how it’s been asked in the past see this document and this study if you have access). For starters, Black and Hispanic groups (and some Native American groups) are already more likely to be underrepresented, relative to other race-ethnic groups (Census 2012). This is because racial and ethnic minority groups disproportionately live in hard-to-count circumstances (see quote from former Census director in the 2012 report). These “hard-to-count” circumstances include high residential mobility, housing irregularity, limited literacy, motives for concealment (such as undocumented status or violation of housing codes), and fears of outsiders (see ASA Census Report for more). Beyond the fact that these groups are difficult to count, during field tests preparing for the 2020 Census, field staff reported that respondents expressed fears about the confidentiality of their data–for example, if ICE will see this data–in addition to the increasing perception that immigrant groups are unwelcome (Census NAC 2017 meeting). This is unusual behavior that has not been observed on this scale in previous counts, and this resulted in the falsification of information and an increase in nonresponse. Although these were field tests with smaller samples of the population, these preliminary results suggest that the inclusion of this question will likely affect the accuracy of the Census data. The question in itself is also a problem. The Census has not had the opportunity to rigorously test the effects of including a question of citizenship status on a larger sample of the population; nor have they tested how respondents will react to the question as currently worded. Moreover, there is absolutely no reason to risk the accuracy of the Census by adding this question. The Constitution does not explicitly state that only citizens shall be counted. For these reasons, adding the question of citizenship to the Census, especially this late, is unnecessary, irresponsible, and potentially harmful.

I’m not alone in my opinion. Over 160 mayors have issued a letter to Commerce Secretary Wilbur Ross, stating that the Census should not include a question about citizenship. The former director of the Census has also expressed his skepticism toward Wilbur Ross’s conclusion that including the question would not do any harm (CityLab 2018). Congressional representatives and officials have also denounced the move (e.g., Rep. Meng of NYCA Attorney General Xavier Becerra; Sens Harris of CA, Carper of DE, Peters of MI, and McCaskill of MO). The reason being that this seems politically motivated, given the racial resentment expressed by the current administration, and that this will hurt American Communities. Here a few ways that this may hurt states and localities.  The George Washington Institute of Public Policy reported that approximately 300 federal programs allocate over $800 billion a year based on Census counts. This same report found that the effects of an undercount on five programs administered by the Department of Health and Human Services accounted for nearly half of all federal grants to states, amounting to 37 states losing a median of $1,091 for each person missed in the 2010 Census. The PEW and the Georgetown University Law Center on Poverty and Inequality show which counties will likely be affected. I’ve also included a chart displaying the Census Bureau’s estimates of US immigrants by jurisdiction below. The bars display the proportion of the population who are immigrants, while the color of the bars indicate the absolute number of immigrants relative to other jurisdictions. The data only include jurisdictions with populations of 100,000 or more.

Census Bureau publishes population estimates for non-U.S. citizens

Notice that it’s not just large cities that will be affected. Places like Waterbury, Connecticut and Gresham, Oregon, which just barely make the list of jurisdictions with populations of 100,000 or more, yet more 10% of their population is comprised of immigrants.

So you may be thinking so what? Why should they be entitled to services provided by our hard-earned tax money. To that I say, this federal funding benefits your community, while also benefiting the individual who receives these benefits. For example, medical services that are paid for in part by federal medical assistance programs, such as Medicaid and CHIP, help maintain the infrastructure and costs of local hospitals and clinics. When there is a decrease in revenue, medical services often leave the community. Additionally, some of this funding goes toward child care, which helps parents keep jobs that allow them to take care of their families while also paying back into the tax system. Also consider that many cities and municipalities generate zoning laws based on census counts. If you own or rent a home, this affects how much you pay for housing. So as a utilitarian or self-interested pragmatist, I urge you to reconsider your position on the importance of the Census.