This post is an addendum to the OLS tutorial I posted two weeks ago. I will walk through the explanations that I provided in the program and I will interpret the output. If you haven’t read the previous post, you can access the materials using this link.
The tutorial begins by telling the user to run the following two lines of code:
clear all
set more off, permanently
The “clear” command will clear out data, along with any value labels, that are currently loaded in Stata. The additional command “all” will clear matrices, scalars,
constraints, clusters, stored results, sersets, and Mata functions and objects from the memory, in addition to closing all open files and postfiles, clear the class system, close any open Graph windows and dialog boxes, drop all programs from memory, and reset all timers to zero.
The next command, “set more off,” changes Stata’s default setting, which is “set more on.” The “set more off” command tells Stata not to pause or display the ” –more–” message when showing results. Personally, I hate pressing spacebar over and over again until all of my results are displayed. Stata will make “set more off” the new default by adding the option (noted by the “,”) “permanently.” Note that once you run this command, you do not have to run it again, now that “set more off” is the default. You can change this back by typing “set more on, permanently”. You can read more about the clear function here and the set more off function here.
In the next step, the tutorial directs you to load your data:
use “[folder path data is in]\[datafile name]”
I stored my data in the folder “C:\Users\Stella\Documents\blog\ols”. This should be put in place of the text:”[folder path data is in]”, including the brackets. The file name is “nlsy97_2015”. This should be put in the place of the text: “[datafile name]”, including brackets, such that the final will read:
use “C:\Users\Stella\Documents\blog\ols\nlsy97_2015”
You could add the option “,clear” to this instead of using the “clear all” command above, and it would effectively accomplish the same thing.
Next, I walk through how to set up a log file. A log records your Stata session. You can run multiple logs at the same time if you wish. It’s nice to keep a log of your work so that you can track what syntax was used to generate your output. In the program, I store the log in the same folder as my data. I name my log SMnlsy97_2015.txt. I add the option “replace” to allow Stata to write over an existing log, if there is one. You generally want to do this because it is likely that you will make mistakes and have to re-run your program.
In the next section of the program, I walk you through commands that allow you to look at your variables. For example, I look at the variable Wage, which is the outcome variable in this tutorial:
codebook Wage
This will produce the following output:
As you can see, codebook will show you information about the variable, such as the type of variable it is (numeric), the lowest value (0) and highest value (110,000), the number of unique values (400), the number of missing values (1400), the mean (40209), the standard deviation (27211.2), and the interquartile range. If you ran the command “codebook” without specifying a variable or variables, it would produce output for every variable in the dataset.
You can view a summary of the variable by typing:
summarize Wage, detail
For short, you can type “sum” in the place of summarize and “d” in the place of detail. This command should produce the following output:
The output confirms some of the details we saw using codebook: The smallest value for Wage is $0, while the largest is $110,000; the mean wage is $40,209.03, with a standard deviation of $27.211.23. Unlike codebook, summarize shows us that there are 5,702 observations for the variable Wage; the median value, represented by the 50th percentile, is $36,000; and we can now see that the data is skewed. The value associated “skewness” tells us the degree and direction in which the data is skewed, which is skewed to the right (indicated by the positive value). Note that we can tell that the variable is skewed because the mean exceeds the median. Kurtosis is a measure of how heavily the data is skewed–heavily skewed data will show a value greater than 3. See UCLA IDRE for a detailed explanation.
You can view a visual representation of the variable by creating a histogram:
histogram Wage, normal
The “normal” option produces a bell curve that approximates what a normal distribution would look like. You can see that the distribution of Wage is not normally distributed. The bars spike above the curve and fall far below the curve throughout the distribution of wage Values (x-axis). Then you can see that there is a sudden spike again in the tail around the wage $150,000, where I top coded the wages.
It is generally expected that you run descriptive statistics on all of your variables. More on this in a second. First, you should check your variables for missing observations. There are several ways to do this. One way to do this is with the command misstable summarize. Note that you can specify which variables to look at (e.g., misstable summarize variable1 variable2…). By default, the command will look at all variables in the data. In the case of these data, that is okay because I’ve shortened the dataset to contain less than 20 variables. The result should be this:
You should see the variables that have missing observations, and the number of observations that are missing. For example, the variable mar_stat has 161 missing observations, whereas the variable biokids has 2,331. The table also reveals the number of unique values associated with each variable, along with the min and max values. The latter results can be observed in the summary table or if you use the command “tabulate”, which will produce frequency tables.
Another way to do this is with the command “mdesc”, which is a user-submitted command that tells Stata to generate a table that shows the missing values:
The screen capture doesn’t show the results for all of the variables because it is cut off in my screen, but you get the idea. Again, by default, mdesc will run through all the variables in the dataset, unless you specify certain variables. Instead of showing you the number of unique values, and the lowest and highest values, mdesc shows you the percent that are missing, in addition to the total number of missing observations. I generally find this more helpful than the results produced by misstable summarize.
Next, I walk through different ways to address missing values. In the class that I made this tutorial for, replacing the missing values with the mean or mode was acceptable, as long as the student explicitly stated so. More advanced users will likely use methods such as multiple imputation (e.g., Stata13 Manual). You can also listwise delete any observations that contain missing values. I provide an example of a loop that will do this:
Most users find this confusing, so bear with me. The command says that “foreach” variable represented by some letter or word–in this case “v’–of the following varlist (aka “var”) *–which is short hand for all variables in this data. Stata is going to drop the observation if it missing. For example, the first variable in the example data is id. Stata will go through id and see if any of the values are missing. No values of id are missing, so Stata will go to the next variable (birth_month) and check for missing variables, and so on. When Stata gets to mar_stat, Stata will see that 161 people did not report their marital status, so Stata will remove them from the data, which will reduce our sample down to 6,941 observations (7,102-161). Then Stat will remove 2,331 observations for missing data on the number of biological children that they have, and so forth. In the end, you should have 3,623 observations. You can check this by running mdesc again.
Side note: Sometime in the future, I will provide a tutorial on loops because it is huge time saver, and it’s something that is not often taught in statistics courses (at least not in any of mine).
Continuing on with the tutorial, in the next section, I show the user how to generate formatted tables with a user-submitted command “estout”:
ssc install estout
You can add the option “replace” to update estout, if you’ve already installed the command. You can read more about the command by visiting this link. Now type the following:
estpost sum *
esttab using “C:\Users\Stella\Documents\blog\ols\OLSdescripts.rtf”, replace ///
cell((mean(label(Mean/Perc.) fmt(%9.2f)) sd(par label(S.D.) fmt(%9.2f)))) label nonumber nomtitle
eststo clear
The command “estpost” will show the results associated with whatever function you tell Stata to run on the data, in this case, I requested a summary of all my variables, as indicated by the asterisk (*). It will produce the following output:
The first column shows you the number of observations; the second shows the summary of weights (i.e., nonmissing observations); the third column shows you the mean of each variable; the fourth shows the variance; the fifth shows the standard deviation; the sixth shows the minimum value; the seventh shows the maximum value; and the last shows the sum of the variable. For descriptive statistics, we are interested in are: the mean, standard deviation, and the total number of observations. We use this information specifically in the next command which tells Stata to create a table (esttab) using this file location (C:\Users\Stella\Documents\blog\ols\) and this file name (OLSdescripts.rtf). I specify .rtf (rich text format) because I want to preserve the format. I use the option replace because I want to replace any existing document with this name. If you do not want to do this, simply change the name of the document (for example, OLSdescript2.rtf). Next, I tell Stata that I want the first cell (i.e., column) to display the mean, which should be labeled “Mean/Perc.” The label will be right justified and the numbers will show include two decimal places. The next cell/column will display the standard deviation, labeled as “S.D.” with the same formatting constraints. The next label option tells Stata that you want the labels associated with each variable, and to avoid adding the additional model number over each column that esttab adds by default. You should get the following output
Note that not all of the variables are labeled. You will want to do this before you submit an assignment or use the table in a research paper.
You can also use esttab to generate formatted regression tables:
I show you how to do this toward the end of the program, after I go through quick explanations of how to check basic OLS assumptions.
A final note about the formatted results. Although esttab is a quick and easy way to create formatted tables within Stata, the user-submitted command tabout will give you even more control over how your results are displayed, especially if you know how to use LaTex. I spent a lot of time tinkering with LaTex and I never mastered it. Maybe a project for the future.