Data and Variables
Module 1 - Section 2: Data and Variables
Introduction
Presenter: Rosana Fok
Key Concepts
Population vs. Sample
Population: The entire collection of individuals or items of interest in a statistical study.
Sample: A subset of the population selected for study in a specified manner.
Parameter: A numerical summary that describes a characteristic of the population; often unknown.
Statistic: A numerical summary that describes a characteristic of the sample; known once data are observed.
Notation:
Greek letters denote parameters, e.g., mean $bc$, standard deviation $π$.
Latin letters denote statistics, e.g., Mean $π¦!$, Standard deviation $s$.
Commonly Used Parameters and Statistics
Summary of key statistics:
Mean:
Statistic: $π¦!$
Parameter: $π$ (mu)
Standard Deviation:
Statistic: $s$
Parameter: $π$ (sigma)
Correlation:
Statistic: $r$
Parameter: $π$ (rho)
Regression Coefficient:
Statistic: $b$
Parameter: $π½$ (beta)
Proportion:
Statistic: $p$
Parameter: $p$
Problem of Data Collection
It is generally impossible to gather all observations from a population.
Population Data: Parameter -> Statistics -> Sampling -> Descriptive Statistics -> Inference
Practical Examples
Estimating Proportions
Example 1: Proportion of male students in a Statistics class.
Population of Interest: All students in the Statistics Class.
Sample: 20 randomly selected students.
Parameter: Proportion of male students in the class.
Statistic: Proportion of male students in the sample.
Example 2: Average study time of female students in Statistics.
Population of Interest: All female students in Statistics class.
Sample: 20 randomly selected female students.
Parameter: Average study time of female students.
Statistic: Average study time of the selected 20 female students.
Understanding Data
Definition of Data
Data can be numbers, record names, or other labels. Context is provided by addressing:
5 W's: Who, What, When, Where, Why
1 H: How
Who: Subject Identification
Definition: The individuals or subjects from whom data is collected.
Subjects: Individuals in an experiment.
Population: Entire set of subjects.
Sample: Observed subset of subjects.
Respondents: Individuals who respond to surveys.
Experimental Units: Includes animals, plants, and inanimate objects.
What: Measuring Characteristics
Definition: The characteristics or variables being measured.
Variables: Characteristics recorded about each individual, which can take different values.
Measurement Units Examples:
Distance: Meter
Mass: Kilogram
Time: Second
Electric Current: Ampere
Temperature: Kelvin
Amount of Substance: Mole
Why: Purpose of Data Collection
Importance: Understanding why data is collected shapes the analysis method and variable treatment.
Key reminder: Need the Who, What, and Why to analyze data.
When and Where: Context of Data Collection
Knowing when and where data is collected provides contextual information that influences interpretation.
Example:
Comparing average price of a pen in the 1960s at the University of Alberta to present values.
Salary comparisons between Canada and third-world countries.
How: Data Collection Methodology
Definition: The methods used for data collection can significantly impact results.
Importance: Proper data gathering is crucial for valid analysis and conclusions.
Potential issues with data collection:
Improper methods lead to incorrect conclusions.
Example: Voluntary internet surveys often yield unreliable results.
Case Study: Monitoring the Future (MTF) Project
Purpose: Study changes in beliefs, attitudes, and behaviors of young Americans.
Who: 8th, 10th, and 12th graders.
What: Alcohol, illegal drug, and cigarette use.
Why: To study changes in beliefs, attitudes, and behaviors.
When: Spring 2004.
Where: United States.
How: Surveys administered to randomly selected students.
Types of Variables
Definition of Variables
Variables are characteristics recorded about each individual.
Types of Variables
Categorical (Qualitative) Variables:
Define categories or groups within subjects.
Count cases in each category.
Types:
Nominal: Levels without a specific order (e.g., hair color).
Ordinal: Levels with a specific order (e.g., educational levels).
Example Variables:
Gender (Male, Female)
Hair color (blonde, brown, black, etc.)
Grade (A+, A, A-, etc.)
Car manufacturer (Ford, Honda, etc.)
Numerical (Quantitative) Variables:
Measure a numerical quantity in each subject.
Types:
Discrete: Distinct values only (e.g., count of siblings).
Continuous: Any value in an interval (e.g., height).
Clarification: Not all data represented by numbers are numerical (e.g., coding schemes like 1 = male, 2 = female).
Example of Variable Classification
Given the following variables from a medical study:
Age (years) β Quantitative
Smoker (yes or no) β Categorical
Systolic blood pressure (mmHg) β Quantitative
Level of calcium in the blood (micrograms/mL) β Quantitative
Drug effectiveness (scale of 1-5) β Categorical
Data Table Example
Each student's individual characteristics displayed:
Attributes Include: Age, Student ID, Height, Gender, Grade.
Context Provided:
Who (row titles) and What (column titles) indicated in the data table.
Note: Student ID is an identifier variable, categorical, with no quantitative units.
Student | Age | Student ID | Height | Gender | Grade |
|---|---|---|---|---|---|
Alice | 19 | 00001 | 165 | F | B+ |
Boris | 20 | 00002 | 170 | M | A- |
Catherine | 18 | 00003 | 158 | F | A |
Conclusion
Acknowledgment of the audience for their attention and participation in the discussion on data and variables.