simpson's paradox

1. statement

Let \(S\) be an event and let \(X\) and \(Y\) be discrete random variables all defined in a common probability space. Then there are cases where \[ \mathbb{P}[S \mid X=0, Y=y] > \mathbb{P}[S \mid X=1, Y=y] \] for all \(y\in Y\), but we do not have \[ \mathbb{P}[S \mid X=0] > \mathbb{P}[S \mid X=1] \] (this isn't really a paradox, it's just something that might be surprising).

2. in words

Let \(S\) be the event in which a desired outcome occurs (recovery, election win, baseball hit). Then \((X=0)\) (a certain treatment, or candidate, or baseball player) may outperform \((X=1)\) (some other treatment, candidate, or player) for every sub-group \(y\), but still lose to \(X=1\) when all groups are considered together.

3. intuition

Let's expand \(\mathbb{P}[S\mid X=0]\). \[ \mathbb{P}[S \mid X=0] = \sum_{y} \mathbb{P}[S\mid X=0, Y=y]\mathbb{P}[Y=y\mid X=0] \] We see that each term \(\mathbb{P}[S\mid X=0, Y=y]\) is weighted by \(\mathbb{P}[Y=y\mid X=0]\). We can do the same expansion for \(\mathbb{P}[S\mid X=1]\). So it could be the case that \(\mathbb{P}[Y\mid X=0]\) and \(\mathbb{P}[Y\mid X=1]\) are distributed in such a way so that \(\mathbb{P}[S, Y=y \mid X=1] > \mathbb{P}[S, Y=y \mid X=0]\).

You can think of an analogy to elections. In round \(y\), canidate 1 won district "1y" and took \(\mathbb{P}[S\mid X=1, Y=y]\) fraction of the votes. And candidate 2 won district "2y" and took \(\mathbb{P}[S\mid X=0, Y=y]\) fraction of the votes. It could be that the fraction of canidate 1 supporters in "1y" is higher than the fraction of canidate 2 supporters in "2y". But we won't know who had more votes until we weight by the populations of the respective districts (\(\mathbb{P}[Y = y \mid X=0]\) and \(\mathbb{P}[Y=y \mid X=1]\)).

4. concrete example

Let \(Y\) and \(X\) take on values \({0,1}\). Let \(X\) index a baseball player. Let \(Y\) index a game. Let \(S\) be the event of a hit. Say that in game 0:

player 1 is up to bat \(50,000\) times and hits \(\frac{49,000}{50,000}\)
player 0 is up \(2\) times and hits \(\frac{1}{2}\)

Then, we have \(\mathbb{P}[S \mid X=0, Y=0] = \frac{1}{2} > \frac{49,000}{50,000} = \mathbb{P}[S\mid X=1, Y=0]\). Then, in game 1:

player 1 is up 10 times and hits \(\frac{1}{10}\)
player 0 is up 10,000 times and hits \(\frac{1,001}{10,000}\)

Then, we have \(\mathbb{P}[S \mid X=0, Y=1] = \frac{1,001}{10,000} > \frac{1}{10} = \mathbb{P}[S\mid X=1, Y=1]\)

But if we look at each player's total batting average, we can see at a glance that player 1 should be around \(\frac{1}{2}\) and player 0 should be around \(\frac{1}{10}\)

5. relevant links

6.436 lecture notes
Blog Post from Ben Recht – He uses a toy example that I'm pretty sure is just Simpson's paradox