Notes on Statistics Basics: Data, Population, Sample, Variables, and Levels of Measurement
Population, Sample, Data, and Variables
- Data are the values that a variable can take; those values are data. For example, the years 2019 or 2020 are data values belonging to the variable of time.
- A variable is a characteristic that can assume different values; data are the values those variables can take.
- Population vs Sample:
- Population: all subjects in the study.
- Sample: a group or subset drawn from the population.
- An individual is a person or subject that is a member of a sample.
- A member of a sample is an individual; a subset of a population is a sample.
- In short:
- Data = values of a variable.
- Variable = a characteristic that can take different values.
- Population = all subjects.
- Sample = subset of the population.
Descriptive vs Inferential Statistics
- Descriptive statistics: organization and summarization of data.
- Examples: conduct a survey, present results in a data table, graph the data to observe trends.
- Calculate the sample mean:
\bar{x} = \frac{\sum{i=1}^{n} xi}{n} - Example (illustrative): If you have ages like age1, age2, age3, age4, you add them and divide by the number of observations (here, 4) to get the mean.
- Inferential statistics (the transcript sometimes uses the term "influential statistics"): use information from a sample to draw conclusions about the population.
- Rationale: sampling the whole population is often costly or impractical, yet we want to infer about the population (e.g., disease spread) from a sample.
- Concept: make a decision or draw a conclusion about a population parameter from sample data.
- Probability, hypothesis testing, and decision making are core tools in inferential statistics:
- Probability = likelihood of an event occurring.
- Hypothesis testing = a technique to make a claim or decision to reject or accept certain conditions based on sample data.
- The goal is to draw meaningful information about the population from the sample.
Parameter vs Statistic; Population vs Sample (Applied Examples)
- Population parameters vs sample statistics:
- A parameter describes a population property (e.g., population mean, population proportion).
- A statistic describes a sample property (e.g., sample mean, sample proportion).
- How to identify population vs sample:
- If you see the word "all" or an indication that the data come from the entire population, it refers to the parameter level.
- If you see data from a randomly selected group, it refers to the statistic level.
- Examples from a disease context (from the transcript):
- Hypothetical hospital example:
- The maximum length of stay among all hospital patients: a parameter (population).
- The average length of stay among a randomly selected group: a statistic (sample).
- The recovery rate from a random group: a statistic (sample).
- Another example: trains data at Opera House Station:
- Records for all trains last year: the population (e.g., average delay = 12.7 minutes; 17.1% delayed; 15 trains canceled).
- A random sample of 50 records audited by Ivana: sample (e.g., average delay = 21.3 minutes; 32% delayed; 2 trains canceled).
- Quick takeaway: the word "all" indicates population parameters; a randomly selected group indicates sample statistics.
Types of Variables: Qualitative vs Quantitative
- Variables can be classified as:
- Qualitative (categorical): describes categories or groups.
- Quantitative (numerical): describes numerical values (can be counted or measured).
- Qualitative vs Quantitative:
- Qualitative: grouped into categories or levels; examples include gender, level of education, satisfaction level, religious preference, geographical location, ZIP code, nationality.
- Quantitative: numerical and can be measured or counted; examples include distance, temperature, price, age, time, weight, height.
- Subtypes within qualitative and quantitative:
- Qualitative can be ordered (ordinal) or not ordered (nominal).
- Quantitative can be discrete (counted) or continuous (measured).
Qualitative vs Quantitative: Examples
- Qualitative examples (and their subtypes):
- Gender: male, female (nominal).
- ZIP code: unique identifiers (nominal).
- Nationality: qualitative, often nominal.
- Level of education: Master’s, PhD, etc. (qualitative; can be ordered or categorized).
- Degree of satisfaction: scale (e.g., satisfied, neutral, dissatisfied) (ordinal).
- Quantitative examples:
- Distance from home to nearest store: quantitative, can be measured (continuous or ratio depending on zero interpretation).
- Temperature: quantitative (continuous; interval level).
- Price: quantitative (continuous; ratio level).
- Weight, height, age: quantitative (ratio level when the zero point means an absence of the quantity).
- Time: quantitative (continuous; ratio level for some measures).
Discrete vs Continuous Variables (within Quantitative)
- Discrete quantitative variables:
- Countable values (usually integers): e.g., number of family members, number of students in a class.
- Continuous quantitative variables:
- Measured values that can take on an infinite number of values between any two values: e.g., height, weight, time, temperature (depending on scale).
- Guiding principle:
- If you can count the values, it is typically discrete; if you can measure and there can be fractional values, it is continuous.
- Examples given in the transcript:
- Arm length of a heavyweight boxer: continuous (measurable length).
- Height: continuous.
- Age and time: continuous.
- Number of errors (count): discrete (countable).
Levels of Measurement for Qualitative and Quantitative Variables
- For qualitative variables (nominal and ordinal):
- Nominal: lowest level; no natural ordering; categories are distinct with no ranking (e.g., gender, ZIP code, nationality).
- Ordinal: qualitative with natural ordering (e.g., letter grades A, B, C, D or customer satisfaction scales like strongly disagree to strongly agree).
- For quantitative variables (interval and ratio):
- Interval: differences between values are meaningful; there is no true zero; zero does not indicate absence of quantity (e.g., Celsius temperature, years of birth in some contexts).
- Ratio: differences and ratios are meaningful; there is a true zero (e.g., weight at birth, height, age, distance, price); this is the only level with a meaningful zero and meaningful ratios (e.g., 8 pounds is twice 4 pounds).
- Summary of levels by type:
- Qualitative: nominal, ordinal.
- Quantitative: interval, ratio.
Quick Classification Exercises (from Transcript)
- Example: Closing price
- Variable type: quantitative (numerical).
- Level of measurement: ratio (closing price has a true zero and ratios are meaningful).
- Example: Marital status (single or married)
- Variable type: qualitative (categorical).
- Level of measurement: nominal (no natural ordering between single and married).
- Example: Temperature
- Variable type: quantitative.
- Level of measurement: interval (differences are meaningful but zero does not represent absence of temperature).
- Example: Distance
- Variable type: quantitative.
- Level of measurement: ratio (has meaningful zero).
- Example: Price (stock price, etc.)
- Variable type: quantitative.
- Level of measurement: ratio.
Worked Practice: Quick Classification from a Transcript Section
- Quick exercise from Alex (population vs sample, parameter vs statistic) – general patterns:
- If all items are described as a complete population (e.g., all trains, all subscribers), the measurement relates to a parameter (population value).
- If a random sample is described (e.g., a random subset of trains or subscribers), the measurement relates to a statistic (sample value).
- Sample classification examples in the text:
- Population: all trains scheduled to depart last year; parameter example: average delay = 12.7 minutes; proportion delayed = 17.1%; total canceled = 15.
- Sample: audited 50 records; sample average delay = 21.3 minutes; 32% delayed; 2 canceled.
- Population: all subscribers = 10,985; parameter: 75% liked at least one blog; average comments across all blogs; all-time likes = 4,135.
- Sample: 400 subscribers polled; 68% liked at least one; average comments per blog = 3.1; 149 liked all blogs (these values are sample statistics).
- Key takeaway: distinguish population vs sample by whether data refer to all subjects or to a subset; distinguish parameter vs statistic by whether the value refers to the population or to a sample.
Practice Questions and Homework (from Transcript)
- Homework discussion:
- Two-question set described: identify population vs sample for described datasets, and identify whether the described values refer to a parameter or a statistic.
- Example dataset: 400 subscribers polled; 68% liked at least one blog; average comments per blog = 3.1; 149 liked all blogs; population data available: total subscribers = 10,985; 75% liked at least one; average comments per blog = 2.4; 4,135 liked all blogs.
- The instructor categorized items as:
- Population vs sample: population corresponds to all, sample to the subset polled.
- Parameter vs statistic: describe which pieces refer to population (parameter) vs sample (statistic).
- The second dataset (operational data about trains) followed the same logic as described above.
Additional Notes from the Session
The instructor discussed the format of the course material and assignments:
- Worksheets exist (Zero, One, and Week 1) and should be completed by the due dates.
- Homework assignments include the two described problems plus related worksheets.
The session was recorded (Zoom) and may be published later.
Quick recap of key concepts:
- Data, variable, population, sample, and individual distinctions.
- Descriptive vs inferential statistics.
- Probability and hypothesis testing as tools for inference.
- Parameter vs statistic.
- Qualitative vs quantitative variables; discrete vs continuous within quantitative.
- Levels of measurement: nominal, ordinal (qualitative); interval, ratio (quantitative).
- Examples help anchor the classifications and levels.
Formulas to remember:
- Sample mean: \bar{x} = \frac{\sum{i=1}^{n} xi}{n}
- The concepts of population vs sample and the use of terms parameter vs statistic apply across examples.
Ethical and practical implications:
- Inference from a sample to a population relies on sample representativeness; sampling costs are often justified by the information gained about the population.
- Clear labeling of parameter vs statistic helps avoid confusion when interpreting results.
Real-world relevance:
- The same framework applies to disease spread, energy usage, transportation efficiency, consumer surveys, and market data.
End of notes