Notes on Statistics Basics: Data, Population, Sample, Variables, and Levels of Measurement

Population, Sample, Data, and Variables

  • Data are the values that a variable can take; those values are data. For example, the years 2019 or 2020 are data values belonging to the variable of time.
  • A variable is a characteristic that can assume different values; data are the values those variables can take.
  • Population vs Sample:
    • Population: all subjects in the study.
    • Sample: a group or subset drawn from the population.
    • An individual is a person or subject that is a member of a sample.
    • A member of a sample is an individual; a subset of a population is a sample.
  • In short:
    • Data = values of a variable.
    • Variable = a characteristic that can take different values.
    • Population = all subjects.
    • Sample = subset of the population.

Descriptive vs Inferential Statistics

  • Descriptive statistics: organization and summarization of data.
    • Examples: conduct a survey, present results in a data table, graph the data to observe trends.
    • Calculate the sample mean:
      \bar{x} = \frac{\sum{i=1}^{n} xi}{n}
    • Example (illustrative): If you have ages like age1, age2, age3, age4, you add them and divide by the number of observations (here, 4) to get the mean.
  • Inferential statistics (the transcript sometimes uses the term "influential statistics"): use information from a sample to draw conclusions about the population.
    • Rationale: sampling the whole population is often costly or impractical, yet we want to infer about the population (e.g., disease spread) from a sample.
    • Concept: make a decision or draw a conclusion about a population parameter from sample data.
  • Probability, hypothesis testing, and decision making are core tools in inferential statistics:
    • Probability = likelihood of an event occurring.
    • Hypothesis testing = a technique to make a claim or decision to reject or accept certain conditions based on sample data.
    • The goal is to draw meaningful information about the population from the sample.

Parameter vs Statistic; Population vs Sample (Applied Examples)

  • Population parameters vs sample statistics:
    • A parameter describes a population property (e.g., population mean, population proportion).
    • A statistic describes a sample property (e.g., sample mean, sample proportion).
  • How to identify population vs sample:
    • If you see the word "all" or an indication that the data come from the entire population, it refers to the parameter level.
    • If you see data from a randomly selected group, it refers to the statistic level.
  • Examples from a disease context (from the transcript):
    • Hypothetical hospital example:
    • The maximum length of stay among all hospital patients: a parameter (population).
    • The average length of stay among a randomly selected group: a statistic (sample).
    • The recovery rate from a random group: a statistic (sample).
  • Another example: trains data at Opera House Station:
    • Records for all trains last year: the population (e.g., average delay = 12.7 minutes; 17.1% delayed; 15 trains canceled).
    • A random sample of 50 records audited by Ivana: sample (e.g., average delay = 21.3 minutes; 32% delayed; 2 trains canceled).
  • Quick takeaway: the word "all" indicates population parameters; a randomly selected group indicates sample statistics.

Types of Variables: Qualitative vs Quantitative

  • Variables can be classified as:
    • Qualitative (categorical): describes categories or groups.
    • Quantitative (numerical): describes numerical values (can be counted or measured).
  • Qualitative vs Quantitative:
    • Qualitative: grouped into categories or levels; examples include gender, level of education, satisfaction level, religious preference, geographical location, ZIP code, nationality.
    • Quantitative: numerical and can be measured or counted; examples include distance, temperature, price, age, time, weight, height.
  • Subtypes within qualitative and quantitative:
    • Qualitative can be ordered (ordinal) or not ordered (nominal).
    • Quantitative can be discrete (counted) or continuous (measured).

Qualitative vs Quantitative: Examples

  • Qualitative examples (and their subtypes):
    • Gender: male, female (nominal).
    • ZIP code: unique identifiers (nominal).
    • Nationality: qualitative, often nominal.
    • Level of education: Master’s, PhD, etc. (qualitative; can be ordered or categorized).
    • Degree of satisfaction: scale (e.g., satisfied, neutral, dissatisfied) (ordinal).
  • Quantitative examples:
    • Distance from home to nearest store: quantitative, can be measured (continuous or ratio depending on zero interpretation).
    • Temperature: quantitative (continuous; interval level).
    • Price: quantitative (continuous; ratio level).
    • Weight, height, age: quantitative (ratio level when the zero point means an absence of the quantity).
    • Time: quantitative (continuous; ratio level for some measures).

Discrete vs Continuous Variables (within Quantitative)

  • Discrete quantitative variables:
    • Countable values (usually integers): e.g., number of family members, number of students in a class.
  • Continuous quantitative variables:
    • Measured values that can take on an infinite number of values between any two values: e.g., height, weight, time, temperature (depending on scale).
  • Guiding principle:
    • If you can count the values, it is typically discrete; if you can measure and there can be fractional values, it is continuous.
  • Examples given in the transcript:
    • Arm length of a heavyweight boxer: continuous (measurable length).
    • Height: continuous.
    • Age and time: continuous.
    • Number of errors (count): discrete (countable).

Levels of Measurement for Qualitative and Quantitative Variables

  • For qualitative variables (nominal and ordinal):
    • Nominal: lowest level; no natural ordering; categories are distinct with no ranking (e.g., gender, ZIP code, nationality).
    • Ordinal: qualitative with natural ordering (e.g., letter grades A, B, C, D or customer satisfaction scales like strongly disagree to strongly agree).
  • For quantitative variables (interval and ratio):
    • Interval: differences between values are meaningful; there is no true zero; zero does not indicate absence of quantity (e.g., Celsius temperature, years of birth in some contexts).
    • Ratio: differences and ratios are meaningful; there is a true zero (e.g., weight at birth, height, age, distance, price); this is the only level with a meaningful zero and meaningful ratios (e.g., 8 pounds is twice 4 pounds).
  • Summary of levels by type:
    • Qualitative: nominal, ordinal.
    • Quantitative: interval, ratio.

Quick Classification Exercises (from Transcript)

  • Example: Closing price
    • Variable type: quantitative (numerical).
    • Level of measurement: ratio (closing price has a true zero and ratios are meaningful).
  • Example: Marital status (single or married)
    • Variable type: qualitative (categorical).
    • Level of measurement: nominal (no natural ordering between single and married).
  • Example: Temperature
    • Variable type: quantitative.
    • Level of measurement: interval (differences are meaningful but zero does not represent absence of temperature).
  • Example: Distance
    • Variable type: quantitative.
    • Level of measurement: ratio (has meaningful zero).
  • Example: Price (stock price, etc.)
    • Variable type: quantitative.
    • Level of measurement: ratio.

Worked Practice: Quick Classification from a Transcript Section

  • Quick exercise from Alex (population vs sample, parameter vs statistic) – general patterns:
    • If all items are described as a complete population (e.g., all trains, all subscribers), the measurement relates to a parameter (population value).
    • If a random sample is described (e.g., a random subset of trains or subscribers), the measurement relates to a statistic (sample value).
  • Sample classification examples in the text:
    • Population: all trains scheduled to depart last year; parameter example: average delay = 12.7 minutes; proportion delayed = 17.1%; total canceled = 15.
    • Sample: audited 50 records; sample average delay = 21.3 minutes; 32% delayed; 2 canceled.
    • Population: all subscribers = 10,985; parameter: 75% liked at least one blog; average comments across all blogs; all-time likes = 4,135.
    • Sample: 400 subscribers polled; 68% liked at least one; average comments per blog = 3.1; 149 liked all blogs (these values are sample statistics).
  • Key takeaway: distinguish population vs sample by whether data refer to all subjects or to a subset; distinguish parameter vs statistic by whether the value refers to the population or to a sample.

Practice Questions and Homework (from Transcript)

  • Homework discussion:
    • Two-question set described: identify population vs sample for described datasets, and identify whether the described values refer to a parameter or a statistic.
    • Example dataset: 400 subscribers polled; 68% liked at least one blog; average comments per blog = 3.1; 149 liked all blogs; population data available: total subscribers = 10,985; 75% liked at least one; average comments per blog = 2.4; 4,135 liked all blogs.
  • The instructor categorized items as:
    • Population vs sample: population corresponds to all, sample to the subset polled.
    • Parameter vs statistic: describe which pieces refer to population (parameter) vs sample (statistic).
  • The second dataset (operational data about trains) followed the same logic as described above.

Additional Notes from the Session

  • The instructor discussed the format of the course material and assignments:

    • Worksheets exist (Zero, One, and Week 1) and should be completed by the due dates.
    • Homework assignments include the two described problems plus related worksheets.
  • The session was recorded (Zoom) and may be published later.

  • Quick recap of key concepts:

    • Data, variable, population, sample, and individual distinctions.
    • Descriptive vs inferential statistics.
    • Probability and hypothesis testing as tools for inference.
    • Parameter vs statistic.
    • Qualitative vs quantitative variables; discrete vs continuous within quantitative.
    • Levels of measurement: nominal, ordinal (qualitative); interval, ratio (quantitative).
    • Examples help anchor the classifications and levels.
  • Formulas to remember:

    • Sample mean: \bar{x} = \frac{\sum{i=1}^{n} xi}{n}
    • The concepts of population vs sample and the use of terms parameter vs statistic apply across examples.
  • Ethical and practical implications:

    • Inference from a sample to a population relies on sample representativeness; sampling costs are often justified by the information gained about the population.
    • Clear labeling of parameter vs statistic helps avoid confusion when interpreting results.
  • Real-world relevance:

    • The same framework applies to disease spread, energy usage, transportation efficiency, consumer surveys, and market data.
  • End of notes