Category Archives: Research

Using difference-in-differences in higher education research

Difference-in-differences is gaining popularity in higher education policy research and for good reason. Under certain conditions, it can help us evaluate the effectiveness of policy changes.

The basic idea is that two groups were following similar trend lines for a period of time. But eventually, one group gets exposed to a new policy (the treatment) while the other group does not (the comparison). This change essentially splits time in two, where the treatment group’s exposure to the policy puts them on a new trend line. Had the policy never been adopted, the two groups would have continued on similar paths.

I made a data file and show the steps I took to conduct this analysis in Stata.

Step 1: Generate treatment variable

Let’s say College A is exposed to the policy change (treatment) and College B is not (comparison). We just need a simple dummy variable to categorize these two groups. Let’s call it “treat,” where College A gets a value of 1 and College B gets a value of 0.

Step 2: Generate “post” policy variable

In the previous step, we didn’t say when the policy change occurred, so we need to do that now. Let’s say it began in 2015, meaning all years from that point forward are “post-policy” while those prior are “pre-policy.”

Step 3: Examine trends for the two groups

Now that we’ve identified the treatment/control and the pre/post periods, we can put it all together in a simple graph. I like to use the user-written “lgraph” command (use “ssc install lgraph, replace” to get it).

We see here the two groups were following similar trends prior to the 2015 policy change, and then the treatment group started to increase at a higher rate while the comparison group did not.

Step 4: Difference-in-differences means table

The visual inspection looks like there’s probably a policy effect, but it’s hard to tell the magnitude. To get at that, we need to measure the difference in groups means before and after the policy:

Below is a simple table calculating the difference between the two groups before (-10) and after (5) the policy. It then calculates the difference before and after within each group (20 and 35, respectively).

Pre Post Difference
Comparison 517.5 537.5 20
Treatment 507.5 542.5 35
Difference -10 5 15

The comparison group increased by 20 units after the policy, while the treatment increased by 35. Similarly, the treatment group was 10 units lower than the comparison group prior to the policy, but was ahead by 5 units after. When we calculate the difference in the group differences, we get 15 (e.g., 35 minus 20, or 5 minus -10).

Step 5: Difference-in-differences regression

We can run a regression on the data using the two variables created in Steps 1 and 2. The only trick is we need to interact those two variables (treat x post) to get our difference-in-differences estimate.

Here, the “treat” dummy measures the treatment group’s pre-policy difference from the comparison group. And “post” is the comparison groups post-policy change. The interaction between the two variables (“treat x post”) is our average treatment effect of 15, the same number we saw in the previous step. The intercept is the comparison group’s pre-policy mean.

In a regression framework, we can easily add covariates, use multiple comparison groups for robustness checks, and address issues that may arise with respect to standard errors. Doing so can help rule out plausible alternative explanations to the findings, assuming other important considerations are also met.

My goal with this post was to break down the difference-in-differences approach to help make it a little more accessible and less intimidating to researchers/policy analysts. I am still leaning a lot about the technique, so please consider these steps some illustrative tips to get oriented/introduced.

Stata replication code:

// generate treatment variable (College A is the treated unit, College B is comparison)
gen treat = 1 if id==1
recode treat (.=0)
lab def treat_lab 0 “comparison” 1 “treatment”
lab val treat treat_lab
tabstat y, by(treat) stat(n mean min max sd)

// generate post-treatment period (2015 is start date)
gen post = 1 if year>=2015
recode post (.=0)
lab def post_lab 0 “pre” 1 “post”
lab val post post_lab
tabstat y, by(post) stat(n mean min max sd)

// descriptives of the two groups
table year treat, c(mean y)
ssc install lgraph, replace
lgraph y year, by(treat) stat(mean) xline(2015) ylab(, nogrid) scheme(s2mono)

// did means table
table treat post, c(mean y)

// did regression
xtset id year
xtreg y i.treat i.post i.treat#i.post

State higher education funding – motion charts

Each year, State Higher Education Executive Officers (SHEEO) produces the go-to guide for state-level trends in higher education finance, downloadable and interactive data here.

Using this data, I created the following motion chart where each bubble represents a state. Pressing the play button shows how appropriations (x-axis) and net tuition revenue (y-axis) change over time.

You can select individual states and change the bubble color/size. All financial data are presented in 2017 dollars based on CPI and the “tuition-to-appropriations” ratio is simply that: net tuition relative to net appropriations, where values over 1 indicate the state generates more revenue from tuition than it appropriates. For more details on the data, see their report.

I hope you find this tool helpful! I like using it as a teaching tool, so please feel free to use and share. I sometimes run into trouble depending on the browser I’m using, so just a heads-up on that front. Unfortunately, Google retired Motion Charts, but I was able to update this with the most recent data so hopefully it has some shelf life.

Please let me know if you see any bugs or if you find interesting patterns!

High school FAFSA filing update

In the first three months of the current (2018-19) FAFSA filing cycle, 1.275 million high school seniors completed the form. This is a 6% increase from last year, as shown in the chart below.

We monitored these weekly trends last year and found about 2.1 million high school seniors filed by the end of the school year. So within just the first three months of the new cycle, we are over half-way to last year’s filing levels.

These trends hold pretty steady across the states, where all but five (MT, OR, VT, WV, WY) are outpacing last year’s figures each week.

Several states are making really strong progress on high school FAFSA completions over last year’s cycle. Arizona, Kansas, Louisiana, Mississippi, and Washington, DC, stand out as having some of the highest growth rates over last year.

We will continue monitoring these trends, especially in early spring when some state aid programs have filing deadlines.

Below is an excel file behind these charts, listing the weekly number of completions for the two cycles, by each state.

State high school FAFSA completions (.xlsx): state_hs_fafsa_week_13

 

 

Enrollment trends in UW Colleges

The UW Colleges are a unique feature of the public higher education landscape. They are classified as “Associate’s: High-Transfer, High-Traditional” colleges, meaning they largely serve traditional-aged students (coming directly out of high school) with a goal of improving transfer from Associate’s to Bachelor’s degree programs.

Nationwide, there are about 160 of these colleges* and they enroll about 1.5 million undergraduates. Since the end of the Great Recession, enrollments have dropped and today’s enrollments are higher than the pre-Recession levels.^

Colleges located in Midwestern states are experiencing similar enrollment declines since 2010, and the following chart shows the four largest Midwestern Higher Education Compact states’ total enrollments for their “high-transfer, high-traditional” Associate’s degree granting colleges:

Using the UW System’s Accountability Dashboard, we can see how UW Colleges’ enrollment levels have changed relative to four-year universities in the system (this chart excludes UW-Madison and UW-Milwaukee). Approximately 12,000 students enrolled during the fall of 2016-17.**

UW Colleges enrolled approximately 7 percent of the system’s total undergraduates in 2016-17 and this share has hovered between 6 and 8 percent over time.

And a growing share of UW Colleges’ enrollments are from students who identify as Black, Hispanic, or Native American. Since 2010, the percentage of students classified into these three racial/ethnic groups has doubled within the UW Colleges.

As conversations about UW Colleges continue, it is also worth stating these two-year institutions serve different missions than the Wisconsin Technical College System (WTCS). To oversimplify the difference, technical colleges are focused more on vocational training while UW Colleges focuses more on transfer. Nevertheless, the final chart uses IPEDS enrollment data to summarize how enrollments in the technical colleges, university system, and UW Colleges have changed over time.

Notes:

* A “college” could include multiple campuses. For example, UW Colleges is reported as a single institution in IPEDS, but accounts for 13 campuses and an online presence.

^ In all charts, we would get slightly different enrollment trends when using 12-month headcounts from IPEDS, but I used fall enrollment (undergraduate total, degree-seeking and non-degree-seeking).

** Note the axis scale on the previous chart makes the recent enrollment drop look less steep and the prior (IPEDS) data only includes 2015 enrollment, not 2016.

Waffle chart in Excel

Below are two ways to display the same data.

The first is a trusty old pie chart.

The second is its cooler pastry cousin, the waffle chart.

These data are from IPEDS, just a quick count of public college undergraduates by their college’s selectivity.

We can see in both charts the majority enroll in open-access colleges. But the waffle is easier on the eye, I think.

Turns out, the choice between pies and waffles can be contentious.

If you’re like me, you probably want to use a waffle chart to display your data sometime.

You can make one of these in R using the “waffle” command…if you are familiar with that program. I, sheepishly, am not.

For those of us still in the old school, I made this waffle chart in Excel. Please feel free to use in your own work, but keep in mind this will only work for up to four slices of pie.

Download this Excel file: waffle template

Step one: Enter up to four category names in cells B3:B6

Step two: Enter percentages (high to low) in cells C3:C6. The template is just filling these with random numbers as a placeholder.

Step three: Choose which chart you like better, the square waffle or the rectangle waffle. Both charts link to the same data.

Note: You’ll see two other tabs in this file, “chart 1” and “chart 2,” which are the underlying formulas driving the chart. Just pretend they’re not there.

I created and modified this based on a helpful guide found here. If you end up using this file, please let me know if you detect any bugs. I think I’ve worked them out, but don’t hesitate to contact me if you see any or have suggestions.

Why we need comparison groups in PBF research

Tennessee began implementing performance-based funding in 2011 as part of its statewide effort to improve college completions. The figure below uses IPEDS completion data plotting the average number of bachelor’s degrees produced by each of the state’s 9 universities over time.

One could evaluate the policy by comparing outcomes in the pre-treatment years against those in the post-treatment years. Doing so would result in a statistically significant “impact,” where 320 more students graduated after the policy was in effect.

Pre Post Difference
Tennessee 1940 2260 320

This Interrupted Times Series approach was recently used in a report concluding that 380 more students (Table 24) graduated because of the policy. My IPEDS estimate and the one produced by the evaluation firm use different data, but are in the same ballpark.

Anyway, simply showing a slope changed after a certain point in time is not strong enough evidence to make causal claims. In very limited conditions can one make causal claims with this approach. But this is uncommon, making interrupted time series a pretty uncommon technique to see in policy research.

A better approach would add a comparison group (a la Comparative Interrupted Time Series or its extensions). If one were to do that, then they would compare trends in Tennessee to other universities in the U.S. The graph below does that just for illustrative purposes:

By adding a comparison group, we can see that the gains experienced in Tennessee were pretty much on par with trends occurring elsewhere in the U.S.:

Pre Post Difference
Tennessee 1940 2260 320
Other states 1612 1850 238
Difference 328 411 83

The difference-in-differences estimate, which is a more accurate measure of the policy’s impact, is 83. And if we run all this through a regression model, we can see if 83 is a significant difference between these groups.

It is not:

Using this more appropriate design would likely yield smaller impacts than those reported in the recent evaluation. And these small impacts likely wouldn’t be distinguishable from zero.

I wanted to share this brief back of the envelope illustration for two reasons. First, I am working on a project examining Tennessee’s PBF policy and the only “impact” we are seeing is in the community college sector (more certificates). We are not finding the same results in associate’s degree or bachelor’s degree production. Second, it gives me an opportunity to explain why research design matters in policy analysis. I don’t pretend to be a methodologist or economist; I am an applied education researcher trying my best to keep up with social science standards. Hopefully this quick post illustrates why that’s important.

June 30 FAFSA report

Below is a summary of high school FAFSA filing up to June 30 for the current and prior filing cycles.

June 30, 2016: 1,949,067
June 30, 2017: 2,128,524

This is a 9% boost in filing, or 179,457 more filers than last year!

You can download the raw data here or below.

We haven’t yet taken a close look at which high schools have shown the most growth, and I won’t pretend to know what these schools did to boost completions, but below is a quick look at the Top 20 in terms of largest raw number increase in FAFSAs. Way to go, Northside High School in Houston, TX, which saw the biggest jump in completions – going from 25 to 457!

State School Name City June 30 completions (16-17) June 30 completions (17-18) Change Percent Change
1 TX NORTHSIDE HIGH SCHOOL HOUSTON 25 457 432 1728%
2 PA PENN FOSTER HS SCRANTON 911 1213 302 33%
3 FL CYPRESS CREEK HIGH ORLANDO 295 499 204 69%
4 IL LINCOLN-WAY EAST HIGH SCHOOL FRANKFORT 327 529 202 62%
5 FL TIMBER CREEK HIGH ORLANDO 396 572 176 44%
6 TX ALLEN H S ALLEN 674 842 168 25%
7 NC ROLESVILLE HIGH ROLESVILLE 95 258 163 172%
8 FL WILLIAM R BOONE HIGH ORLANDO 304 467 163 54%
9 CA WARREN HIGH DOWNEY 556 710 154 28%
10 CA RANCHO VERDE HIGH MORENO VALLEY 465 614 149 32%
11 IL JONES COLLEGE PREP HIGH SCHOOL CHICAGO 226 374 148 65%
12 FL OLYMPIA HIGH ORLANDO 318 460 142 45%
13 FL FREEDOM HIGH ORLANDO 374 515 141 38%
14 NY NEW UTRECHT HIGH SCHOOL BROOKLYN 372 511 139 37%
15 NY BRENTWOOD HIGH SCHOOL BRENTWOOD 491 630 139 28%
16 TX THE WOODLANDS H S THE WOODLANDS 419 555 136 32%
17 PA PHILADELPHIA PERFORMING ARTS CS PHILADELPHIA 7 142 135 1929%
18 TX LOS FRESNOS H S LOS FRESNOS 341 476 135 40%
19 UT COPPER HILLS HIGH WEST JORDAN 256 390 134 52%
20 CA VALENCIA HIGH VALENCIA 318 449 131 41%

And in Wisconsin, here’s a list of the Top 20 schools in terms of raw growth in completions:

State School Name City June 30 completions (16-17) June 30 completions (17-18) Change Percent Change
1 WI KING INTERNATIONAL MILWAUKEE 201 278 77 38%
2 WI BADGER HIGH LAKE GENEVA 148 217 69 47%
3 WI OCONOMOWOC HIGH OCONOMOWOC 194 262 68 35%
4 WI SUN PRAIRIE HIGH SUN PRAIRIE 238 306 68 29%
5 WI CASE HIGH RACINE 160 227 67 42%
6 WI EAST HIGH APPLETON 170 235 65 38%
7 WI REAGAN COLLEGE PREPARATORY HIGH MILWAUKEE 206 269 63 31%
8 WI RIVERSIDE HIGH MILWAUKEE 183 243 60 33%
9 WI MIDDLETON HIGH MIDDLETON 239 296 57 24%
10 WI DE PERE HIGH DE PERE 180 235 55 31%
11 WI BAY PORT HIGH GREEN BAY 232 281 49 21%
12 WI FRANKLIN HIGH FRANKLIN 229 277 48 21%
13 WI NORTH HIGH WAUKESHA 119 166 47 39%
14 WI HAMILTON HIGH MILWAUKEE 128 174 46 36%
15 WI EAST HIGH MADISON 173 218 45 26%
16 WI CENTRAL HIGH SALEM 134 178 44 33%
17 WI CENTRAL HIGH LA CROSSE 128 171 43 34%
18 WI MUSKEGO HIGH MUSKEGO 220 263 43 20%
19 WI WAUNAKEE HIGH WAUNAKEE 154 193 39 25%
20 WI WEST HIGH WAUKESHA 149 188 39 26%

We will continue to analyze this data and plan to merge with other data sources to gain a better understanding of the variation that exists in filing rates.

We want to be sure to make this data available along the way, so please feel free to download and use the following high school and state-level data comparing the two cycles: FAFSA completions to June 30.xlsx

I wish I had time and resources to make this data more user-friendly and to share more widely. But until then, hopefully this good old fashioned Excel file is of use!

More College Scorecard code for Stata

This post provides a new way to import and manage College Scorecard data in Stata.

Alan Riley, Stata’s VP for software development, created the following code that creates the same panel as described in my previous post.

But this one gets the job done in about 15 minutes!

Alan reached out after reading my previous post and seeing some of our Twitter conversation. He didn’t like hearing how long it took me (and others) to run this. So, Alan took it as a challenge to figure out a better way.

And that he did!

I asked if I could share this code here on my blog and he eagerly agreed. He offered one comment for users:

Just one caveat:

I feel that I just made some quick tweaks to the code, and there are probably not only more optimizations that could be found, but it could also be made more robust to potential problems with more use of Stata’s -confirm- and -assert- commands in various places to ensure everything is as expected.

Nick Cox (yes, the same Nick Cox from the Stata Forum) also reached out with some suggestions on the -destring- command I was previously using. I was destringing a lot of unnecessary items, which Alan fixed and greatly speeds up the process. Kevin Fosnacht offered all the labeling code in the previous post. And Daniel Klasik noted some quirks between Mac and PC users, where the “unitid” variable may create an error for Mac users. Thanks, guys!

And here is a .do file where I copied and pasted Alan’s original code (updated 11/14/2017): https://uwmadison.box.com/s/wo6dtfp141x103cen3uq2zr8s09av610

Working with College Scorecard data

This post provides Stata commands to create a panel dataset out of the College Scorecard data. The first steps take only a few minutes, then the final destring step takes quite a while to run (at least an hour).

Step 1: Download the .csv files from here

In this example, I download these files into a “Scorecard” folder located on my desktop.

Notice how the file names have underscores in them. The first file is titled “MERGED1996_97_PP” and so on. Let’s drop the underscore and number in between (“_97_”) so the file now is titled “MERGED1996PP” and do this for each file:

Step 2: Convert .csv files to .dta files

Our raw data is now downloaded and prepped. Open a Stata .do file and import these .csv files. The following loop is a nice way to do this efficiently:

  • Row 2 tells Stata where to find the raw .csv data.
  • Row 5 tells Stata to remember (as “i”) any number that fall between 1996 and 2014.
  • Row 6 tells Stata to import any .csv file in my directory named “MERGED`i’PP” — here the `i’ is replaced by the numbers 1996 through 2014. This is why we got rid of the underscores in Step 1, it helps this loop operate efficiently.
  • Row 7 generates a new variable called “year” and sets its value equal to the corresponding year in the file.
  • Row 8 saves each .csv file as a .dta file and keeps the same file naming convention as before.
  • Row 9 tells Stata to start anew after each year so we end up with the following files in our directory:

Notice the .dta files are all now here and named in the exact same way as the original .csv files (well, technically I added an underscore after “MERGED”). This loop takes a few minutes to run.

Step 3: Append each year into a single file

Now that we have individual files, we want to stack them all on top of each other. We have to start somewhere, so let’s open the 1996 .dta file and then append all future years, 1997 through 2014, onto this file:

We will do this in a small loop this time, where:

  • Row 13 tells Stata to open the 1996 .dta file
  • Row 14 tells Stata to remember as “i” the numbers 1997 to 2014
  • Row 15 tells Stata to append onto the 1996 file each .dta that corresponds with the year `i’. The “force” command is needed here only because some years a variable is coded as a string and others numeric, so this just tells Stata to bypass that quirk.

Step 4: Destring the data

The data is now in a panel format, where each row contains a unique institution and year. Let’s take a quick look at UW-Madison (unitid=240444) by using the following command:

br ïunitid year instnm  debt_mdn if ïunitid==240444

We’re almost done, but notice a couple of quirks here. First, the median debt value is red, meaning it is string (because 1996 is “NULL”). Second, the variable ïunitid should be renamed to just plain old “unitid” – otherwise, things look pretty good here in terms of setting up a panel.

The following loop will clean this all up for us. It took me at least an hour, be warned. (Thank you Nick Cox for your helpful de-bugging!)

  • Row 19 renames ïunitid to unitid
  • Row 20 tells Stata to find string variables
  • Row 21 tells Stata to remember as “x” all variables in the file
  • Rows 22 and 23 replace all those NULLs and PrivacySuppressed cells with “.n” and “.p” missing values, respectively.
  • Row 25 destrings and replaces all variables

So now when we look at UW-Madison, we see the median debt values are black, meaning they are destringed (destrung?) with the “NULL” value for 1996 missing.

Step 5: Add variable and value labels

Thanks to Kevin Fosnacht (Indiana University), you can also add labels with the commands found below in the .do file. Thanks, Kevin!

Step 6: Save the new file so you don’t have to run all this again.

You now have all Scorecard data from 1996 to 2014 in a single panel dataset, allowing you to merge with other data sources and analyze this with panel data techniques.

There are several other ways one could go about managing this dataset and creating the panel, so forgive me for any omissions or oversights here. My goal is to simply help remove a data access barrier to facilitate greater research use of this public resource. Grad students developing their research programs might find this a useful data source for contributing to and extending higher education policy scholarship.

.do file: scorecard

Update 6/12/2017: the “.n” and “.p” command had a small bug on the destring loop. But thanks to Nick Cox, this is now fixed! Also, Kevin Fosnacht was so kind as to share his relabeling code, many thanks!

October to June high school FAFSA filing

We are 36 weeks into the 2017-18 FAFSA cycle and 2.04 million high school seniors have completed the form as of June 2, 2017. This is an increase of about 200,000 over last cycle, or about a 10% overall increase:

In this chart, a few important spikes stand out. The first is the holiday bumps we see in November and December, where filing slows down in a couple of weeks and the rebounds after the new year. The second is the March bump, as several states (notably, CA) have filing deadlines. The third is an early April bump that is simply an artifact of the way the Office of Federal Student aid began defining a high school senior. Up to that week, seniors included students no older than 18. Now it includes students no older than 19. This is a good change in the long-run, but ideally we would have 18-year-olds reported for the entire cycle.

With these details in mind, below is a chart showing the same trend but for each individual state (to June 2, 2017):

To put this another way, here’s a map showing the percent change in the two cycles, where Utah grew the most (39% increase) and Rhode Island the least (6% decrease).

 

Raw data (.xlsx): State Weeks – Oct to June 2