Data and Variables

Module 1 - Section 2: Data and Variables

Introduction

  • Presenter: Rosana Fok

Key Concepts

Population vs. Sample
  • Population: The entire collection of individuals or items of interest in a statistical study.

  • Sample: A subset of the population selected for study in a specified manner.

  • Parameter: A numerical summary that describes a characteristic of the population; often unknown.

  • Statistic: A numerical summary that describes a characteristic of the sample; known once data are observed.

  • Notation:

    • Greek letters denote parameters, e.g., mean $bc$, standard deviation $𝜎$.

    • Latin letters denote statistics, e.g., Mean $𝑦!$, Standard deviation $s$.

Commonly Used Parameters and Statistics
  • Summary of key statistics:

    • Mean:

    • Statistic: $𝑦!$

    • Parameter: $πœ‡$ (mu)

    • Standard Deviation:

    • Statistic: $s$

    • Parameter: $𝜎$ (sigma)

    • Correlation:

    • Statistic: $r$

    • Parameter: $𝜌$ (rho)

    • Regression Coefficient:

    • Statistic: $b$

    • Parameter: $𝛽$ (beta)

    • Proportion:

    • Statistic: $p$

    • Parameter: $p$

Problem of Data Collection
  • It is generally impossible to gather all observations from a population.

  • Population Data: Parameter -> Statistics -> Sampling -> Descriptive Statistics -> Inference

Practical Examples

Estimating Proportions
  1. Example 1: Proportion of male students in a Statistics class.

    • Population of Interest: All students in the Statistics Class.

    • Sample: 20 randomly selected students.

    • Parameter: Proportion of male students in the class.

    • Statistic: Proportion of male students in the sample.

  2. Example 2: Average study time of female students in Statistics.

    • Population of Interest: All female students in Statistics class.

    • Sample: 20 randomly selected female students.

    • Parameter: Average study time of female students.

    • Statistic: Average study time of the selected 20 female students.

Understanding Data

Definition of Data
  • Data can be numbers, record names, or other labels. Context is provided by addressing:

    • 5 W's: Who, What, When, Where, Why

    • 1 H: How

Who: Subject Identification
  • Definition: The individuals or subjects from whom data is collected.

    • Subjects: Individuals in an experiment.

    • Population: Entire set of subjects.

    • Sample: Observed subset of subjects.

    • Respondents: Individuals who respond to surveys.

    • Experimental Units: Includes animals, plants, and inanimate objects.

What: Measuring Characteristics
  • Definition: The characteristics or variables being measured.

  • Variables: Characteristics recorded about each individual, which can take different values.

  • Measurement Units Examples:

    • Distance: Meter

    • Mass: Kilogram

    • Time: Second

    • Electric Current: Ampere

    • Temperature: Kelvin

    • Amount of Substance: Mole

Why: Purpose of Data Collection
  • Importance: Understanding why data is collected shapes the analysis method and variable treatment.

  • Key reminder: Need the Who, What, and Why to analyze data.

When and Where: Context of Data Collection
  • Knowing when and where data is collected provides contextual information that influences interpretation.

  • Example:

    • Comparing average price of a pen in the 1960s at the University of Alberta to present values.

    • Salary comparisons between Canada and third-world countries.

How: Data Collection Methodology
  • Definition: The methods used for data collection can significantly impact results.

  • Importance: Proper data gathering is crucial for valid analysis and conclusions.

  • Potential issues with data collection:

    • Improper methods lead to incorrect conclusions.

    • Example: Voluntary internet surveys often yield unreliable results.

Case Study: Monitoring the Future (MTF) Project

  • Purpose: Study changes in beliefs, attitudes, and behaviors of young Americans.

    • Who: 8th, 10th, and 12th graders.

    • What: Alcohol, illegal drug, and cigarette use.

    • Why: To study changes in beliefs, attitudes, and behaviors.

    • When: Spring 2004.

    • Where: United States.

    • How: Surveys administered to randomly selected students.

Types of Variables

Definition of Variables
  • Variables are characteristics recorded about each individual.

Types of Variables
  1. Categorical (Qualitative) Variables:

    • Define categories or groups within subjects.

    • Count cases in each category.

    • Types:

      • Nominal: Levels without a specific order (e.g., hair color).

      • Ordinal: Levels with a specific order (e.g., educational levels).

    • Example Variables:

      • Gender (Male, Female)

      • Hair color (blonde, brown, black, etc.)

      • Grade (A+, A, A-, etc.)

      • Car manufacturer (Ford, Honda, etc.)

  2. Numerical (Quantitative) Variables:

    • Measure a numerical quantity in each subject.

    • Types:

      • Discrete: Distinct values only (e.g., count of siblings).

      • Continuous: Any value in an interval (e.g., height).

    • Clarification: Not all data represented by numbers are numerical (e.g., coding schemes like 1 = male, 2 = female).

Example of Variable Classification

  • Given the following variables from a medical study:

    • Age (years) – Quantitative

    • Smoker (yes or no) – Categorical

    • Systolic blood pressure (mmHg) – Quantitative

    • Level of calcium in the blood (micrograms/mL) – Quantitative

    • Drug effectiveness (scale of 1-5) – Categorical

Data Table Example

  • Each student's individual characteristics displayed:

    • Attributes Include: Age, Student ID, Height, Gender, Grade.

    • Context Provided:

    • Who (row titles) and What (column titles) indicated in the data table.

    • Note: Student ID is an identifier variable, categorical, with no quantitative units.

Student

Age

Student ID

Height

Gender

Grade

Alice

19

00001

165

F

B+

Boris

20

00002

170

M

A-

Catherine

18

00003

158

F

A

Conclusion

  • Acknowledgment of the audience for their attention and participation in the discussion on data and variables.