Notes on Statistics Basics: Data, Population, Sample, Variables, and Levels of Measurement

Population, Sample, Data, and Variables

Data are the values that a variable can take; those values are data. For example, the years 2019 or 2020 are data values belonging to the variable of time.
A variable is a characteristic that can assume different values; data are the values those variables can take.
Population vs Sample:
- Population: all subjects in the study.
- Sample: a group or subset drawn from the population.
- An individual is a person or subject that is a member of a sample.
- A member of a sample is an individual; a subset of a population is a sample.
In short:
- Data = values of a variable.
- Variable = a characteristic that can take different values.
- Population = all subjects.
- Sample = subset of the population.

Descriptive vs Inferential Statistics

Descriptive statistics: organization and summarization of data.
- Examples: conduct a survey, present results in a data table, graph the data to observe trends.
- Calculate the sample mean:
  $\bar{x} = \frac{\sum<em>{i=1}^{n} x</em>i}{n}$
- Example (illustrative): If you have ages like age1, age2, age3, age4, you add them and divide by the number of observations (here, 4) to get the mean.
Inferential statistics (the transcript sometimes uses the term "influential statistics"): use information from a sample to draw conclusions about the population.
- Rationale: sampling the whole population is often costly or impractical, yet we want to infer about the population (e.g., disease spread) from a sample.
- Concept: make a decision or draw a conclusion about a population parameter from sample data.
Probability, hypothesis testing, and decision making are core tools in inferential statistics:
- Probability = likelihood of an event occurring.
- Hypothesis testing = a technique to make a claim or decision to reject or accept certain conditions based on sample data.
- The goal is to draw meaningful information about the population from the sample.

Parameter vs Statistic; Population vs Sample (Applied Examples)

Population parameters vs sample statistics:
- A parameter describes a population property (e.g., population mean, population proportion).
- A statistic describes a sample property (e.g., sample mean, sample proportion).
How to identify population vs sample:
- If you see the word "all" or an indication that the data come from the entire population, it refers to the parameter level.
- If you see data from a randomly selected group, it refers to the statistic level.
Examples from a disease context (from the transcript):
- Hypothetical hospital example:
- The maximum length of stay among all hospital patients: a parameter (population).
- The average length of stay among a randomly selected group: a statistic (sample).
- The recovery rate from a random group: a statistic (sample).
Another example: trains data at Opera House Station:
- Records for all trains last year: the population (e.g., average delay = 12.7 minutes; 17.1% delayed; 15 trains canceled).
- A random sample of 50 records audited by Ivana: sample (e.g., average delay = 21.3 minutes; 32% delayed; 2 trains canceled).
Quick takeaway: the word "all" indicates population parameters; a randomly selected group indicates sample statistics.

Types of Variables: Qualitative vs Quantitative

Variables can be classified as:
- Qualitative (categorical): describes categories or groups.
- Quantitative (numerical): describes numerical values (can be counted or measured).
Qualitative vs Quantitative:
- Qualitative: grouped into categories or levels; examples include gender, level of education, satisfaction level, religious preference, geographical location, ZIP code, nationality.
- Quantitative: numerical and can be measured or counted; examples include distance, temperature, price, age, time, weight, height.
Subtypes within qualitative and quantitative:
- Qualitative can be ordered (ordinal) or not ordered (nominal).
- Quantitative can be discrete (counted) or continuous (measured).

Qualitative vs Quantitative: Examples

Qualitative examples (and their subtypes):
- Gender: male, female (nominal).
- ZIP code: unique identifiers (nominal).
- Nationality: qualitative, often nominal.
- Level of education: Master’s, PhD, etc. (qualitative; can be ordered or categorized).
- Degree of satisfaction: scale (e.g., satisfied, neutral, dissatisfied) (ordinal).
Quantitative examples:
- Distance from home to nearest store: quantitative, can be measured (continuous or ratio depending on zero interpretation).
- Temperature: quantitative (continuous; interval level).
- Price: quantitative (continuous; ratio level).
- Weight, height, age: quantitative (ratio level when the zero point means an absence of the quantity).
- Time: quantitative (continuous; ratio level for some measures).

Discrete vs Continuous Variables (within Quantitative)

Discrete quantitative variables:
- Countable values (usually integers): e.g., number of family members, number of students in a class.
Continuous quantitative variables:
- Measured values that can take on an infinite number of values between any two values: e.g., height, weight, time, temperature (depending on scale).
Guiding principle:
- If you can count the values, it is typically discrete; if you can measure and there can be fractional values, it is continuous.
Examples given in the transcript:
- Arm length of a heavyweight boxer: continuous (measurable length).
- Height: continuous.
- Age and time: continuous.
- Number of errors (count): discrete (countable).

Levels of Measurement for Qualitative and Quantitative Variables

For qualitative variables (nominal and ordinal):
- Nominal: lowest level; no natural ordering; categories are distinct with no ranking (e.g., gender, ZIP code, nationality).
- Ordinal: qualitative with natural ordering (e.g., letter grades A, B, C, D or customer satisfaction scales like strongly disagree to strongly agree).
For quantitative variables (interval and ratio):
- Interval: differences between values are meaningful; there is no true zero; zero does not indicate absence of quantity (e.g., Celsius temperature, years of birth in some contexts).
- Ratio: differences and ratios are meaningful; there is a true zero (e.g., weight at birth, height, age, distance, price); this is the only level with a meaningful zero and meaningful ratios (e.g., 8 pounds is twice 4 pounds).
Summary of levels by type:
- Qualitative: nominal, ordinal.
- Quantitative: interval, ratio.

Quick Classification Exercises (from Transcript)

Example: Closing price
- Variable type: quantitative (numerical).
- Level of measurement: ratio (closing price has a true zero and ratios are meaningful).
Example: Marital status (single or married)
- Variable type: qualitative (categorical).
- Level of measurement: nominal (no natural ordering between single and married).
Example: Temperature
- Variable type: quantitative.
- Level of measurement: interval (differences are meaningful but zero does not represent absence of temperature).
Example: Distance
- Variable type: quantitative.
- Level of measurement: ratio (has meaningful zero).
Example: Price (stock price, etc.)
- Variable type: quantitative.
- Level of measurement: ratio.

Worked Practice: Quick Classification from a Transcript Section

Quick exercise from Alex (population vs sample, parameter vs statistic) – general patterns:
- If all items are described as a complete population (e.g., all trains, all subscribers), the measurement relates to a parameter (population value).
- If a random sample is described (e.g., a random subset of trains or subscribers), the measurement relates to a statistic (sample value).
Sample classification examples in the text:
- Population: all trains scheduled to depart last year; parameter example: average delay = 12.7 minutes; proportion delayed = 17.1%; total canceled = 15.
- Sample: audited 50 records; sample average delay = 21.3 minutes; 32% delayed; 2 canceled.
- Population: all subscribers = 10,985; parameter: 75% liked at least one blog; average comments across all blogs; all-time likes = 4,135.
- Sample: 400 subscribers polled; 68% liked at least one; average comments per blog = 3.1; 149 liked all blogs (these values are sample statistics).
Key takeaway: distinguish population vs sample by whether data refer to all subjects or to a subset; distinguish parameter vs statistic by whether the value refers to the population or to a sample.

Practice Questions and Homework (from Transcript)

Homework discussion:
- Two-question set described: identify population vs sample for described datasets, and identify whether the described values refer to a parameter or a statistic.
- Example dataset: 400 subscribers polled; 68% liked at least one blog; average comments per blog = 3.1; 149 liked all blogs; population data available: total subscribers = 10,985; 75% liked at least one; average comments per blog = 2.4; 4,135 liked all blogs.
The instructor categorized items as:
- Population vs sample: population corresponds to all, sample to the subset polled.
- Parameter vs statistic: describe which pieces refer to population (parameter) vs sample (statistic).
The second dataset (operational data about trains) followed the same logic as described above.

Additional Notes from the Session

The instructor discussed the format of the course material and assignments:
- Worksheets exist (Zero, One, and Week 1) and should be completed by the due dates.
- Homework assignments include the two described problems plus related worksheets.
The session was recorded (Zoom) and may be published later.
Quick recap of key concepts:
- Data, variable, population, sample, and individual distinctions.
- Descriptive vs inferential statistics.
- Probability and hypothesis testing as tools for inference.
- Parameter vs statistic.
- Qualitative vs quantitative variables; discrete vs continuous within quantitative.
- Levels of measurement: nominal, ordinal (qualitative); interval, ratio (quantitative).
- Examples help anchor the classifications and levels.
Formulas to remember:
- Sample mean: $\bar{x} = \frac{\sum<em>{i=1}^{n} x</em>i}{n}$
- The concepts of population vs sample and the use of terms parameter vs statistic apply across examples.
Ethical and practical implications:
- Inference from a sample to a population relies on sample representativeness; sampling costs are often justified by the information gained about the population.
- Clear labeling of parameter vs statistic helps avoid confusion when interpreting results.
Real-world relevance:
- The same framework applies to disease spread, energy usage, transportation efficiency, consumer surveys, and market data.
End of notes