DID Basics

Brantly Callaway and Pedro H.C. Sant’Anna

2019-06-20

Note: This is a work in progress…

This vignette discusses the basics of using Difference-in-Differences (DID) designs to identify and estimate the average effect of participating in a treatment with a particular focus on tools from the did package.

A Running Example

Throughout the vignette, we use a subset of data that comes from Callaway and Sant’Anna (2019). This is a dataset that contains county-level teen employment rates from 2003-2007. The data can be loaded by

library(did)
data(mpdta)

mpdta is a balanced panel with 2500 observations. And the dataset looks like

head(mpdta)
#>     year countyreal     lpop     lemp first.treat treat
#> 866 2003       8001 5.896761 8.461469        2007     1
#> 841 2004       8001 5.896761 8.336870        2007     1
#> 842 2005       8001 5.896761 8.340217        2007     1
#> 819 2006       8001 5.896761 8.378161        2007     1
#> 827 2007       8001 5.896761 8.487352        2007     1
#> 937 2003       8019 2.232377 4.997212        2007     1

In particular applications, the dataset should look like this with the key parts being:

Here are some additional comments about the data structure:

Identification

First, we provide a brief overview of how identification works as well as parameters of interest in DID designs.

The main identifying assumption in DID designs is called a parallel trends assumption. Let \(Y_{it}(0)\) denote an individual’s untreated “potential” outcome in time period \(t\) and \(Y_{it}(1)\) denote an individual’s treated “potential” outcome in time period \(t\). The observed outcome for an individual is \(Y_{it} = D_i Y_{it}(1) - (1-D_i)Y_{it}(0)\). \begin{align*} E[\Delta Y_t(0) | X, D=1] = E[\Delta Y_t(0)|X,D=1] \end{align*}

Estimation

Two-Groups / Two Periods

Multiple Groups and Periods

Common Issues using the did package

References