Data Basics Types of Variables Numerical (Quantitative) Continuous: Can take any value within a range (e.g., height, weight). Discrete: Can only take

Data Basics

Types of Variables

  • Numerical (Quantitative)

    • Continuous: Can take any value within a range (e.g., height, weight).

    • Discrete: Can only take distinct values (e.g., number of students in a class).

  • Categorical

    • Regular: No inherent order (e.g., colors).

    • Ordinal: Meaningful order but no consistent difference between categories (e.g., Likert scale responses).


Relationships Among Variables

  • Associated (Dependent) Variables: Show some connection.

  • Independent Variables: No connection between variables.

  • Scatterplots & Correlations: Used to analyze positive, negative, or no associations between variables.


Observational Studies & Sampling Strategies

Populations & Samples

  • Population: Entire group of interest (described by parameters).

  • Sample: Subset of the population used for the study (described by statistics).

Sampling Bias / Poor Sampling Designs

  • Non-response Bias: When only a small fraction of sampled individuals respond.

  • Voluntary Response Bias: When individuals self-select, often those with strong opinions.

  • Convenience Sampling: Sampling individuals that are easily accessible.

Types of Sampling

  • Simple Random Sample: Every individual has an equal chance of selection.

  • Stratified Sampling: Population divided into subgroups (strata) with random samples from each.

  • Cluster Sampling: Randomly selecting entire groups (clusters) rather than individuals.

  • Multistage Sampling: Sampling in stages using different methods.


Experiments

Principles of Experimental Design

  1. Control: Compare treatment and control groups.

  2. Randomize: Randomly assign subjects to treatments.

  3. Replicate: Ensure enough samples or replicate the study.

  4. Block: Group subjects based on known variables that affect the response.

Other Key Concepts

  • Placebo Effect: Psychological response to a non-active treatment.

  • Blinding: Prevents bias by concealing treatment assignment.


Ethics in Data Collection

  • Institutional Review Boards (IRB): Protects study subjects.

  • Informed Consent: Participants must be fully informed and provide consent.

  • Confidentiality vs. Anonymity:

    • Confidentiality: Identities are separated from the data.

    • Anonymity: Identities are not collected.


Examining Variables

  • Distribution of a Variable: Describes values a variable takes and their frequency.

  • Graphical Tools:

    • Categorical Variables: Pie charts, bar graphs.

    • Quantitative Variables: Histograms, stemplots, boxplots.


Summarizing Data

Categorical Variables

  • Frequency Distribution Table: Shows counts & percentages.

  • Bar Plot: Represents frequencies or proportions; can be stacked/side-by-side.

  • Contingency Table: Summarizes data for two categorical variables.

Quantitative Variables

  • Measures of Center:

    • Mean: Average of the dataset.

    • Median: The middle value of ranked data.

  • Measures of Spread:

    • Variance: Average squared deviation from the mean.

    • Standard Deviation: Square root of variance.

    • IQR (Interquartile Range): Q3 - Q1.

  • Stemplots & Histograms:

    • Show data distribution and density.

    • Shape Descriptions: Unimodal, bimodal, symmetric, skewed.

    • Bin Width: Affects visualization of data distribution.

  • Box Plots (Box-and-Whisker Plots):

    • Visualizes quartiles, IQR, and outliers.

    • Useful for comparing multiple groups.

  • Robust Statistics:

    • For skewed distributions: Median & IQR are better measures than mean & standard deviation.


Correlation & Line of Best Fit

  • Scatterplots: Visualize relationships between two numerical variables.

  • Correlation Coefficient (r):

    • Measures strength & direction of linear relationships (-1 to +1).

  • Least Squares Regression Line:

    • Minimizes squared residuals; used for prediction.

  • Residuals:

    • Difference between observed & predicted values.

    • Residual Plots: Assess model fit.

  • R² (Coefficient of Determination):

    • Proportion of variability explained by the model.

  • Slope & Intercept:

    • Slope: Expected change in response variable per unit change in explanatory variable.

    • Intercept: Expected value when explanatory variable = 0.

  • Prediction & Extrapolation:

    • Predicting within data range is reliable.

    • Extrapolation (beyond the data range) is often unreliable.

robot