blocking
1. example
- you want to measure the effect of fertilizer on growth. You have four fields, each at a different elevation. If you randomly selected which fields to fertilize, your estimation of the effect will include variance that is due to elevation. So instead you should apply fertilizer/no fertilizer at each elevation and estimate the growth per elevation
2. basic idea
- you have a factor and you want to measure the response to that factor
- but there could be other factors confounding that response (see confounds)
- so, if you can identify the confound, then you can group your experiments into blocks where variance, should hopefully only be due to your primary factor
- what if there's other confounds within the block – "lurking" confounds
- whenever we can't do blocking, we just need to randomize and hope for the best
3. questions
- I get the basic idea of blocking, but I'm not sure what the actual result of this analysis is supposed to be. Am I supposed to report the estimation per block? Or am I supposed to treat each block estimate as an experiment and then estimate using the block estimates?
- If it's the former: seems like things would get complicated with things you block
- If it's the latter:
- Then aren't you re-including the variance due to the nuisance factor that you tried to block for?
- Response to the above: No, because the experiments are uniformly divided among the nuisance factor's categories. Aggregating over the block estimates allows you to get an estimate of the effect on the population as a whole.
- But, if the nuisance variable has a strong effect, then you're probably going to want to look at the non-aggregated results anyways at somepoint.
4. statistical model
On the question I ask above, wikipedia gives the following statistical model for one primary factor and one nuisance factor: \[ Y_{ij} = \mu + T_i + B_j + \epsilon \] where
- \(\epsilon\) is random error
- \(Y_{ij}\) is some observation with primary factor \(i\) and nuisance factor \(j\)
- \(\mu\) is the overall average
- \(T_i\) is the effect of \(i\)
- \(B_j\) is the effect of the nuisance \(j\)
Then, our estimations will be:
- estimate of \(\mu\): \(\bar{Y}\) – average of all observations
- estimtae of \(T_i\): \(\bar{Y}_{i.} - \bar{Y}\) where \(\bar{Y}_{i.}\) is the average over \(\forall j . Y_{ij}\)
- we want to know how much better the effect is, conditioned on \(i\)
- Question: will the estimated \(T_i\) be a good estimate if \(Y_{ij}\) are not uniformly distributed across all \(j\)? I'm pretty sure no, otherwise, we would be misled by confounds. Ah I see Wikipedia says that this is the statistical model for randomized block design. I think this means that you assume the observations are uniformly distributed across all \(j\), for each primary factor setting \(i\).
5. sources
- blocking wikipedia page
- very simple example. not a lot of technical depth
- penn state stats
- wikipedia page on pseudoreplication
- wikipedia page on paired difference test
- references the formula \(\var(Y_2 - Y_1) = \var(Y_2) + \var(Y_1) - 2cov(Y_2 - Y_1)\), which is also referenced in the wikipedia page for blocking