Monthly Archives: June 2017

More College Scorecard code for Stata

This post provides a new way to import and manage College Scorecard data in Stata.

Alan Riley, Stata’s VP for software development, created the following code that creates the same panel as described in my previous post.

But this one gets the job done in about 15 minutes!

Alan reached out after reading my previous post and seeing some of our Twitter conversation. He didn’t like hearing how long it took me (and others) to run this. So, Alan took it as a challenge to figure out a better way.

And that he did!

I asked if I could share this code here on my blog and he eagerly agreed. He offered one comment for users:

Just one caveat:

I feel that I just made some quick tweaks to the code, and there are probably not only more optimizations that could be found, but it could also be made more robust to potential problems with more use of Stata’s -confirm- and -assert- commands in various places to ensure everything is as expected.

Nick Cox (yes, the same Nick Cox from the Stata Forum) also reached out with some suggestions on the -destring- command I was previously using. I was destringing a lot of unnecessary items, which Alan fixed and greatly speeds up the process. Kevin Fosnacht offered all the labeling code in the previous post. And Daniel Klasik noted some quirks between Mac and PC users, where the “unitid” variable may create an error for Mac users. Thanks, guys!

And here is a .do file where I copied and pasted Alan’s original code (updated 11/14/2017):

Working with College Scorecard data

This post provides Stata commands to create a panel dataset out of the College Scorecard data. The first steps take only a few minutes, then the final destring step takes quite a while to run (at least an hour).

Step 1: Download the .csv files from here

In this example, I download these files into a “Scorecard” folder located on my desktop.

Notice how the file names have underscores in them. The first file is titled “MERGED1996_97_PP” and so on. Let’s drop the underscore and number in between (“_97_”) so the file now is titled “MERGED1996PP” and do this for each file:

Step 2: Convert .csv files to .dta files

Our raw data is now downloaded and prepped. Open a Stata .do file and import these .csv files. The following loop is a nice way to do this efficiently:

  • Row 2 tells Stata where to find the raw .csv data.
  • Row 5 tells Stata to remember (as “i”) any number that fall between 1996 and 2014.
  • Row 6 tells Stata to import any .csv file in my directory named “MERGED`i’PP” — here the `i’ is replaced by the numbers 1996 through 2014. This is why we got rid of the underscores in Step 1, it helps this loop operate efficiently.
  • Row 7 generates a new variable called “year” and sets its value equal to the corresponding year in the file.
  • Row 8 saves each .csv file as a .dta file and keeps the same file naming convention as before.
  • Row 9 tells Stata to start anew after each year so we end up with the following files in our directory:

Notice the .dta files are all now here and named in the exact same way as the original .csv files (well, technically I added an underscore after “MERGED”). This loop takes a few minutes to run.

Step 3: Append each year into a single file

Now that we have individual files, we want to stack them all on top of each other. We have to start somewhere, so let’s open the 1996 .dta file and then append all future years, 1997 through 2014, onto this file:

We will do this in a small loop this time, where:

  • Row 13 tells Stata to open the 1996 .dta file
  • Row 14 tells Stata to remember as “i” the numbers 1997 to 2014
  • Row 15 tells Stata to append onto the 1996 file each .dta that corresponds with the year `i’. The “force” command is needed here only because some years a variable is coded as a string and others numeric, so this just tells Stata to bypass that quirk.

Step 4: Destring the data

The data is now in a panel format, where each row contains a unique institution and year. Let’s take a quick look at UW-Madison (unitid=240444) by using the following command:

br ïunitid year instnm  debt_mdn if ïunitid==240444

We’re almost done, but notice a couple of quirks here. First, the median debt value is red, meaning it is string (because 1996 is “NULL”). Second, the variable ïunitid should be renamed to just plain old “unitid” – otherwise, things look pretty good here in terms of setting up a panel.

The following loop will clean this all up for us. It took me at least an hour, be warned. (Thank you Nick Cox for your helpful de-bugging!)

  • Row 19 renames ïunitid to unitid
  • Row 20 tells Stata to find string variables
  • Row 21 tells Stata to remember as “x” all variables in the file
  • Rows 22 and 23 replace all those NULLs and PrivacySuppressed cells with “.n” and “.p” missing values, respectively.
  • Row 25 destrings and replaces all variables

So now when we look at UW-Madison, we see the median debt values are black, meaning they are destringed (destrung?) with the “NULL” value for 1996 missing.

Step 5: Add variable and value labels

Thanks to Kevin Fosnacht (Indiana University), you can also add labels with the commands found below in the .do file. Thanks, Kevin!

Step 6: Save the new file so you don’t have to run all this again.

You now have all Scorecard data from 1996 to 2014 in a single panel dataset, allowing you to merge with other data sources and analyze this with panel data techniques.

There are several other ways one could go about managing this dataset and creating the panel, so forgive me for any omissions or oversights here. My goal is to simply help remove a data access barrier to facilitate greater research use of this public resource. Grad students developing their research programs might find this a useful data source for contributing to and extending higher education policy scholarship.

.do file: scorecard

Update 6/12/2017: the “.n” and “.p” command had a small bug on the destring loop. But thanks to Nick Cox, this is now fixed! Also, Kevin Fosnacht was so kind as to share his relabeling code, many thanks!

October to June high school FAFSA filing

We are 36 weeks into the 2017-18 FAFSA cycle and 2.04 million high school seniors have completed the form as of June 2, 2017. This is an increase of about 200,000 over last cycle, or about a 10% overall increase:

In this chart, a few important spikes stand out. The first is the holiday bumps we see in November and December, where filing slows down in a couple of weeks and the rebounds after the new year. The second is the March bump, as several states (notably, CA) have filing deadlines. The third is an early April bump that is simply an artifact of the way the Office of Federal Student aid began defining a high school senior. Up to that week, seniors included students no older than 18. Now it includes students no older than 19. This is a good change in the long-run, but ideally we would have 18-year-olds reported for the entire cycle.

With these details in mind, below is a chart showing the same trend but for each individual state (to June 2, 2017):

To put this another way, here’s a map showing the percent change in the two cycles, where Utah grew the most (39% increase) and Rhode Island the least (6% decrease).


Raw data (.xlsx): State Weeks – Oct to June 2