Data Collection Fundamentals

Introduction to Data Collection

This section, Module 1 - Section 3, focuses on fundamental principles of data collection, as presented by Rosana Fok.
Illustrative Example: Ice-Cream Sales and Advertising
- Scenario: An ice-cream store began advertising in a local newspaper in May.
- Observation: The following month saw a $40 \%$ increase in ice-cream sales.
- Conclusion Drawn: "The advertisement was effective in increasing ice-cream sales."
- Significance: This example highlights an observed correlation and a conclusion implying causation. While there's a strong association, establishing a definitive cause-and-effect relationship requires careful study design. Without controlled conditions like random allocation, other factors (e.g., warmer weather, a holiday, local events) could also have contributed to the sales increase, making the causal claim premature. This scenario sets the stage for understanding the importance of random sampling and random allocation in drawing reliable inferences.

Key Concepts: Random Sampling and Random Allocation

Random Sampling
- Definition: The process of gathering a sample by randomly selecting individuals or units directly from the entire population of interest.
- Purpose: To ensure that the sample is representative of the larger population, thereby minimizing sampling bias and allowing for accurate generalization.
- Outcome: Enables Population Inference.
Random Allocation (or Random Assignment)
- Definition: The process of randomly assigning subjects (who are already part of a sample) into different treatment groups or control groups within an experimental study.
- Purpose: To create groups that are comparable in all aspects except for the treatment they receive. This minimizes the influence of confounding variables.
- Outcome: Crucial for establishing cause-and-effect relationships (Causal Inference).

Two Types of Inferences

Statistical Inference: The process of using data from a sample to draw conclusions or make predictions about a larger population or to determine a cause-and-effect relationship. The type of inference possible depends heavily on the study design's use of random sampling and/or random allocation.
Population Inference
- Definition: Making a conclusion about the characteristics or parameters of the entire population based on the data collected from a sample. This allows for the generalization of findings from a sample to the broader population from which it was drawn.
- Condition for Validity: Can only be made reliably if the sample was obtained using Random Sampling. Without random sampling, the sample may not accurately represent the population, leading to biased or invalid generalizations.
Causal Inference
- Definition: Making a conclusion about a cause-and-effect relationship between variables, specifically that a change in one variable (the cause) directly leads to a change in another variable (the effect).
- Condition for Validity: Can only be made reliably if the subjects in an experiment were assigned to treatment groups using Random Allocation. This helps control for confounding variables and ensures that the groups are comparable, allowing researchers to isolate the effect of the treatment.
Summary of Inference Capabilities Based on Study Design:
- If Random Sampling is YES and Random Allocation is YES: Both Population Inference and Causal Inference are possible. This is the strongest study design (e.g., a randomized controlled trial with a representative sample), allowing for generalizable causal conclusions.
- If Random Sampling is YES and Random Allocation is NO: Only Population Inference is possible. Results can be generalized to the population, but definitive cause-and-effect relationships cannot be established due to potential confounding factors (e.g., an observational study with a representative sample).
- If Random Sampling is NO and Random Allocation is YES: Only Causal Inference is possible within the specific sample studied. You can establish cause-and-effect among the participants in your study, but you cannot generalize these findings to a larger population because the sample itself was not randomly selected and may not be representative.
- If Random Sampling is NO and Random Allocation is NO: Neither Population Inference nor Causal Inference can be reliably made. Conclusions are limited to the observed sample without generalizability or strong causal claims.

Different Types of Random Samples

These are various methods used to implement random sampling, each with its own advantages depending on the population structure and research goals.
SRS (Simple Random Sample)
- Description: Every possible sample of a given size from the population has an equal chance of being selected. This is the most basic form of random sampling.
- Method: Typically involves assigning a unique numerical identifier to each member of the population and then using a random number generator to select the sample.
Systematic Random Sample
- Description: Subjects are selected at regular, predetermined intervals from an ordered list of the population.
- Method: Determines a sampling interval $(k)$ by dividing the population size by the desired sample size $(k = \frac{\text{Population Size}}{\text{Sample Size}})$ . A random starting point is chosen within the first $k$ individuals, and then every $k^{th}$ individual thereafter is selected.
Stratified Random Sample
- Description: The population is first divided into distinct, non-overlapping subgroups (strata) based on a shared characteristic (e.g., age groups, gender, socioeconomic status). A simple random sample is then drawn independently from each stratum.
- Purpose: Ensures that all relevant subgroups are represented proportionally in the total sample, which can improve the precision of overall population estimates and allow for subgroup analysis.
Cluster Random Sample
- Description: The population is divided into naturally occurring groups or clusters (e.g., geographical areas, schools, hospitals). A random sample of these clusters is selected, and then all individuals within the chosen clusters are included in the overall sample.
- Purpose: Often used when it is impractical, too costly, or impossible to create a sampling frame of individual units or to sample individuals spread across a wide area. It is more efficient for large populations but may introduce more variability compared to SRS or stratified sampling if clusters are not internally diverse.