blocking

1. example

you want to measure the effect of fertilizer on growth. You have four fields, each at a different elevation. If you randomly selected which fields to fertilize, your estimation of the effect will include variance that is due to elevation. So instead you should apply fertilizer/no fertilizer at each elevation and estimate the growth per elevation

2. basic idea

you have a factor and you want to measure the response to that factor
but there could be other factors confounding that response (see confounds)
so, if you can identify the confound, then you can group your experiments into blocks where variance, should hopefully only be due to your primary factor
what if there's other confounds within the block – "lurking" confounds
- whenever we can't do blocking, we just need to randomize and hope for the best

3. questions

I get the basic idea of blocking, but I'm not sure what the actual result of this analysis is supposed to be. Am I supposed to report the estimation per block? Or am I supposed to treat each block estimate as an experiment and then estimate using the block estimates?
If it's the former: seems like things would get complicated with things you block
If it's the latter:
- Then aren't you re-including the variance due to the nuisance factor that you tried to block for?
- Response to the above: No, because the experiments are uniformly divided among the nuisance factor's categories. Aggregating over the block estimates allows you to get an estimate of the effect on the population as a whole.
- But, if the nuisance variable has a strong effect, then you're probably going to want to look at the non-aggregated results anyways at somepoint.

4. statistical model

On the question I ask above, wikipedia gives the following statistical model for one primary factor and one nuisance factor: \[ Y_{ij} = \mu + T_i + B_j + \epsilon \] where

\(\epsilon\) is random error
\(Y_{ij}\) is some observation with primary factor \(i\) and nuisance factor \(j\)
\(\mu\) is the overall average
\(T_i\) is the effect of \(i\)
\(B_j\) is the effect of the nuisance \(j\)

Then, our estimations will be:

estimate of \(\mu\): \(\bar{Y}\) – average of all observations
estimtae of \(T_i\): \(\bar{Y}_{i.} - \bar{Y}\) where \(\bar{Y}_{i.}\) is the average over \(\forall j . Y_{ij}\)
- we want to know how much better the effect is, conditioned on \(i\)
- Question: will the estimated \(T_i\) be a good estimate if \(Y_{ij}\) are not uniformly distributed across all \(j\)? I'm pretty sure no, otherwise, we would be misled by confounds. Ah I see Wikipedia says that this is the statistical model for randomized block design. I think this means that you assume the observations are uniformly distributed across all \(j\), for each primary factor setting \(i\).

5. sources

blocking wikipedia page
very simple example. not a lot of technical depth
penn state stats
wikipedia page on pseudoreplication
wikipedia page on paired difference test
- references the formula \(\var(Y_2 - Y_1) = \var(Y_2) + \var(Y_1) - 2cov(Y_2 - Y_1)\), which is also referenced in the wikipedia page for blocking