Experimental Design & Randomization — Week 03 Notes

Week 03, Lecture 01

  • Learning objectives (from Page 2):

    • Identify the experimental and observational units in a given experiment
    • List and describe the three main principles of experimental design: Randomization, Replication, Blocking
    • Describe the layout and setup of a CRD, RCBD, and a simple factorial experimental design
    • Discuss and critique a given experimental design
    • Identify sources of variation within a given experimental design
  • Key concepts and definitions (Pages 3–7):

    • Experimental unit: the smallest portion of experimental material which is independently perturbed; the item under study for which a treatment is changed (e.g., a human subject or an agricultural plot).
    • Observational unit (subsample): the smallest unit on which a response is measured. If the experimental unit is split after treatment is applied, the resulting subsamples are observational units. If one measurement is made on each experimental unit, observational unit = experimental unit. If multiple measurements are made per subject, this is pseudo- or technical replication.
    • Treatment (factor): an experimental condition independently applied to an experimental unit. Values of the treatments within a set are called levels.
    • Dependent variable / response: the output measured after an experiment; what is observed to assess the effect of changing the treatment(s).
    • Effect: the change in the response variable caused by the controlled changes in the independent variable. The significance or practical importance of the effect is determined by analyses.
  • Observational context and example from Page 3 (Between-group vs within-group variation):

    • Between-group variation: differences across groups (e.g., species); differences to the overall average can be examined.
    • Within-group variation: variability within a single group/species.
    • Example data notes: Adelie, Chinstrap, Gentoo species have species averages (e.g., 48.834; 47.568; 38.824) with within-group variation leading to overall patterns.
  • Experimental layout demonstrations (Pages 8–9):

    • Live demo protocol for a Sweet Sorting Experiment (Treatment A: Unwrap then Sort; Treatment B: Sort then Unwrap)
    • Roles within groups: Notetaker, Timekeeper, Participants
    • Block identification by participant time preference (Morning vs Afternoon)
    • Randomized treatment allocation within blocks (using a fair method such as the sample() function in R)
    • Data collection fields: Group name, Participant number, Block (Morning/Afternoon), Treatment (A/B), Time, Observations (optional)
    • Purpose of blocking: to account for nuisance factors (e.g., time-of-day effects) and improve the precision of comparisons
  • Coffee bean example: an early factorial example (Pages 11–15, 28–31)

    • Experimental setup: three coffee bean types (Arabica, Liberica, Robusta), 12 identical cups, sets of four cups are randomly allocated to one of the three treatments (bean types). Four sets of each bean type are ground and cups prepared identically, yielding 12 cups total. Samples are taken from each cup and coffee strength is measured.
    • Scientific question: Does coffee strength differ between bean types?
    • Experimental unit and observational unit clarification (Page 13–15):
    • Experimental unit: the coffee cup, as each cup is allocated a different bean type (treatment).
    • Observational unit depends on sampling strategy (e.g., a single ml sample per cup leads to the observational unit being the cup; if subsamples are taken from each ml, the observational unit could be a subsample).
    • Output interpretation examples (Page 15):
    • If four 1 ml samples are taken from each cup and a measurement is taken from each sample, then the observational unit is each 1 ml sample × (subunits within cup may lead to more complex replication structures).
    • Design concepts introduced:
    • Factorial design allows two or more factors to be studied simultaneously; enables assessment of interaction effects between factors.
    • Balanced design: equal numbers of replicates in each cell; unbalanced design occurs when replicates differ across cells.
    • Interaction example with bean type (Arabica, Liberica, Robusta) and grinder type (Manual, Electric): A.M, A.E, L.M, L.E, R.M, R.E denote combinations of bean type and grinder type.
  • Three key principles of experimental design (Pages 16–23):

    • Replication
    • Biological replication: measure treatment effects on several biological units (humans, animals, or plants) to generalize results to a population.
    • Technical replication: two or more samples from the same biological source processed independently; increases precision when processing steps introduce substantial variation.
    • Pseudo-replication: one biological sample split into multiple aliquots measured independently; may be used to gain precision but does not add genuine biological replication.
    • Randomization
    • Random allocation reduces bias by ensuring each treatment has the same chance of receiving units, thus balancing unknown factors across treatments.
    • Helps ensure that observed differences are attributable to treatments rather than systematic bias; supports analysis assuming independence of observations.
    • Blocking
    • Blocking reduces nuisance variation by grouping similar experimental units into blocks.
    • Within-block units are more alike than between-block units; blocking accounts for nuisance factors (e.g., age class, time of day).
  • Experimental designs (Pages 23–33):

    • Completely randomized design (CRD)
    • One treatment factor with t levels; n experimental units divided randomly into t groups; each group receives one level of the treatment.
    • With a single treatment factor, other independent variables are controlled to avoid bias.
    • Random allocation can be performed by simple methods (e.g., drawing lots) or using R's sample() function.
    • Randomised complete block design (RCBD)
    • One treatment factor with t levels; b blocks; each block contains t experimental units, one per treatment level.
    • Blocking factor (e.g., cup type) is included to control variation due to this nuisance factor.
    • Within each block, treatments are randomly allocated to units; blocks should be as homogeneous as possible while differences between blocks provide the variability needed to draw conclusions.
    • Factorial design
    • Two or more factors studied simultaneously; all combinations of factor levels are considered.
    • Enables estimation of main effects and interaction effects.
    • Example: bean type (Arabica, Liberica, Robusta) × grinder type (Manual, Electric).
    • Notation: A = bean type factor with levels {Arabica, Liberica, Robusta}; G = grinder type with levels {Manual, Electric}.
    • Interaction: the effect of one factor depends on the level of another factor.
    • Balanced vs unbalanced designs: balanced = equal replicates in each cell; unbalanced = unequal replicates across cells.
  • Mean coffee strength and interpretation of factorial results (Page 31–32):

    • In a factorial design, effects are interpreted through combinations of main effects and potential interactions.
    • Visual example (textual representation):
    • Bean type and grinder effects can be analyzed separately, or in combination to reveal interaction terms.
    • Interaction note (Page 32): If an interaction exists, the effect of one factor on the response changes depending on the level of the other factor.
  • Key points (Page 33):

    • Experimental unit = the item where a treatment is changed.
    • Observational unit = the smallest unit on which a response is measured.
    • Replication, Randomization, Blocking are essential design features to improve inference and reduce bias.
  • Week 03, Lecture 02: Randomization tests and p-values (Pages 35–57):

    • Learning objectives recap:
    • Formulate a question/hypothesis; write null and alternative hypotheses using statistical notation; write R code to conduct a randomization test; interpret a p-value.
    • Example data workflow (Sweets dataset, Page 36):
    • Time (response) distribution by Treatment; violin plot used to visualize differences.
    • Example hypothesis testing (Pages 37–39):
    • Question: Are means of two treatments significantly different?
    • Observed data example: for Treatment A vs Treatment B, the observed mean difference is Δobs = μA − μ_B ≈ 1.494 (seconds).
    • Randomization test framework (Pages 42–44):
    • Step 1: Choose a test statistic. Common choice: the difference in means, Δ = ar{X}A − ar{X}B.
    • Step 2: Compute the observed statistic: Δ_obs = 1.494.
    • Step 3: Randomly reassign treatment labels within the data to generate a reshuffled dataset; compute the test statistic for each reshuffled dataset. Repeat for a large number of iterations (n_reps).
    • Example code pattern (R-like): mutate(random_labels = sample(Treatment, replace = FALSE)) and recompute the diff in means for each rep.
    • Step 4: Build the sampling distribution of the test statistic under the null hypothesis (H0).
    • Step 5: Compute the p-value as the proportion of reshuffled statistics that are as extreme or more extreme than Δobs. For a two-sided test: p = P(|Δ^*| ≥ |Δobs|).
    • Example results (Pages 48–53):
    • Observed difference Δ_obs = 1.494.
    • In 1000 reshuffles, the counts with |Δ^*| ≥ |Δ_obs| were 267 (≥) and 315 (≤) for the respective tail extremes.
    • P-value (two-sided) ≈ 0.582 (i.e., 58.2% of the time a difference as extreme or more extreme would be observed under random labeling).
    • Interpretation and cautions (Pages 46–49):
    • A p-value is the probability under the null model that a statistic would be as extreme as observed; it does not measure the probability that the studied hypothesis is true.
    • A p-value does not by itself indicate the size of an effect or its practical importance.
    • Smaller p-values indicate stronger evidence against the null, but thresholds (e.g., 0.05) are conventions, not universal cutoffs; multiple testing requires controlling the family-wise error rate (FWER).
    • ASA statement on p-values (Pages 46–49):
    • Emphasizes good statistical practice: design, transparent reporting, context-aware interpretation; no single index should replace scientific reasoning.
    • Type I and Type II errors (Pages 58–65):
    • Type I error (false positive): reject H0 when H0 is true.
      • α = P( ext{Type I error}) = P( ext{reject } H0 ext{ given } H0 ext{ true}).
    • Type II error (false negative): fail to reject H0 when an alternative is true.
      • β = P( ext{do not reject } H0 ext{ when } H1 ext{ is true}).
    • Power: 1 − β; the probability of correctly rejecting H0 when H1 is true.
    • Significance level and power balance: decreasing α reduces false positives but may reduce power; increasing sample size, increasing effect size, or reducing experimental variance can increase power.
    • Family-Wise Error Rate (FWER): when making multiple comparisons, the probability of at least one Type I error across the family of tests; control approaches aim to keep FWER at a desired level (e.g., α).
    • Summary statements (Pages 65–67):
    • Emphasizes randomization, exploration of null distributions via simulation, and careful interpretation of p-values within the broader context of study design and data quality.
  • Additional reflections and terminology (Pages 58–66):

    • Reiteration of Type I vs Type II error concepts and their trade-offs.
    • Emphasis on reporting, transparency, and proper interpretation of statistical evidence.
    • The notes remind readers to connect statistical results to the underlying study design, data sources, and practical context.
  • Practical takeaways for exam preparation:

    • Distinguish clearly between experimental units and observational units; know how replication types map to these units.
    • Be able to describe CRD, RCBD, and factorial designs, including when blocking is used and how randomization is applied within blocks.
    • Understand how to interpret main effects and interactions in factorial designs; recognize when an interaction alters the interpretation of main effects.
    • Be able to set up null and alternative hypotheses for two-sample comparisons: H<em>0:μ</em>A=μ<em>B,H</em>1:μ<em>Aμ</em>B.H<em>0: \mu</em>A = \mu<em>B, \quad H</em>1: \mu<em>A \neq \mu</em>B.
    • Be able to compute and interpret the difference in means Δ=Xˉ<em>AXˉ</em>B\Delta = \bar{X}<em>A - \bar{X}</em>B and to conduct a simple randomization test by permuting labels, computing a distribution of Δ\Delta^*, and obtaining a p-value from the empirical distribution.
    • Remember key p-value interpretations, and the ASA guidance against over-interpreting p-values as probabilities of hypotheses being true.
    • Be aware of Type I vs Type II errors, statistical power, and how experimental design choices impact these errors (e.g., sample size, effect size, variance, and multiple testing concerns).
  • Quick glossary (condensed):

    • Experimental unit: the unit to which a treatment is applied.
    • Observational unit: the unit on which a response is measured; may be a subsample if there are multiple measurements per experimental unit.
    • Replication: repeating measurements across multiple independent units or samples to generalize findings.
    • Technical replication: multiple measurements from the same unit; helps assess measurement precision.
    • Pseudo-replication: treating non-independent subsamples as independent replicates; can inflate precision estimates.
    • Blocking: arranging experimental units into homogeneous groups to reduce nuisance variation.
    • CRD: Completely Randomized Design.
    • RCBD: Randomized Complete Block Design.
    • Factorial design: experiments with two or more factors studied in all combinations; allows detection of interactions.
    • Interaction: when the effect of one factor depends on the level of another factor.
    • p-value: probability under the null model of obtaining a statistic as extreme as observed; does not directly measure hypothesis truth or effect size.
    • FWER: Family-Wise Error Rate, the probability of making at least one Type I error among multiple tests.
  • Notation recap (for quick study):

    • H<em>0:μ</em>A=μ<em>B,H</em>1:μ<em>Aμ</em>B.H<em>0: \mu</em>A = \mu<em>B, \quad H</em>1: \mu<em>A \neq \mu</em>B.
    • Δ=Xˉ<em>AXˉ</em>B.\Delta = \bar{X}<em>A - \bar{X}</em>B.
    • Observed: Δobs=1.494.\Delta_{obs} = 1.494.
    • Permutation p-value: p = \frac{#{|\Delta^*| \ge |\Delta{obs}|}}{n{reps}}.