Notes on Population, Sampling, Variables, Data Types, Ethics, and Business Analytics
Population and Sampling
- Population (N): the entire group of interest that you want to learn about (e.g., all people in a class, the whole university population, all cars in a lot).
- Sample (n): a subset selected from the population according to certain criteria.
- Examples from the transcript:
- Population = everyone in this class; Sample = freshmen in this class.
- Population = all students at the university; Sample = freshmen.
- Population = all cars parked at the university; Sample = red cars only.
- Population = all traffic cameras in Salt Lake City; Sample = cameras in a single ZIP code.
- Why sample? Often you cannot measure or observe the entire population, so you study a representative subset to infer about the population.
- Visual intuition:
- A box representing the entire population (e.g., this class). A subset box (sample) contains only those meeting criteria (e.g., all employees in accounting).
- Visuals help distinguish populations from samples and show how samples can be much smaller or differently composed than the full population.
- Important quiz note: when asked to calculate the average for a population, include every member; when asked for a sample, calculate the average only over the sampled individuals.
- Data need not be physical objects; examples include abstract data like the ages of people in a city or the ages of residents in Sandy, Utah.
- Practical example: if you have data for 100 cameras (population slice) and you compute the number of accidents per day for those 100 cameras, you may infer about accidents citywide, but you must consider sampling bias and representativeness.
- A numerical illustration: full population with a criterion (e.g., pull out all positive integers) yields a sample subset that is smaller than the full population.
- Key takeaway: population vs. sample is about breadth (entire group) vs. depth (subset meeting criteria); samples are used to infer properties of the population.
- Variables and data principles (linking to later sections): all data you collect could be a variable; not every data type is suitable for every analysis; some measurements yield meaningful statistics, others do not.
- Variables, data, and measurement:
- A variable is any characteristic, attribute, or quantity that can be measured or observed and can vary over time.
- For data to be usable in statistics, it must be measurable/observable and capable of being re-measured if the population changes.
- Examples of variables include height, weight, temperature, number of students in a class, eye color, etc.
- Example conecting variables to analysis: hours of study vs exam score. Collect data on study hours and corresponding exam scores to examine relationships (e.g., correlation or regression).
- Inferential vs descriptive use cases: variables help analyze relationships, test hypotheses, and make predictions in inferential statistics.
Variables and Data Types
- A data value can be a variable. However, not all data can be analyzed with the same methods; types of variables determine appropriate analyses.
- What is a variable?
- A characteristic, attribute, or quantity that can be measured or observed and can vary across individuals or entities.
- Data must be measurable/observed to be usable.
- Data can change over time; hence the label "variable".
- Quantitative vs Qualitative data:
- Quantitative (quantity): data that can be counted or measured numerically.
- Qualitative (categorical): data that describe qualities or categories, not inherently numerical.
- Two broad quantitative subtypes:
- Continuous data: can take on an infinite number of values within a range (in practice, a very fine granularity).
- Discrete data: countable values; distinct, separate values (often integers).
- Qualitative data split into:
- Nominal data: categories with no inherent order or ranking (e.g., gender, eye color, type of cuisine).
- Ordinal data: categories with a meaningful order, but intervals between categories are not necessarily uniform (e.g., freshman→senior; customer satisfaction ratings; hotel star ratings).
- Levels of measurement (for quantitative and qualitative data): nominal, ordinal, interval, ratio.
- Summary of levels:
- Nominal: categories with labels; no natural ordering; no meaningful arithmetic.
- Ordinal: ordered categories; differences between categories are not necessarily uniform; arithmetic is questionable.
- Interval: numeric values with meaningful differences; zero is arbitrary (does not indicate absence of amount).
- Ratio: numeric values with meaningful differences and a true zero; allows meaningful ratios and all arithmetic.
- Common examples by level:
- Nominal: gender, eye color, type of cuisine.
- Ordinal: freshman/sophomore/junior/senior; customer satisfaction ratings; hotel star ratings; race placements (first, second, third).
- Interval: temperature scales in Celsius or Fahrenheit; calendar years (time between years is meaningful, but zero year is arbitrary).
- Ratio: height, weight, time duration, number of items sold, number of courses taken.
- Key properties to remember:
- Interval data allow addition and subtraction with meaningful differences but do not have a true zero.
- Ratio data have a true zero and allow all arithmetic operations, including meaningful ratios.
- For discrete data, you can count exact values (e.g., number of students, items sold).
- For continuous data, values can be measured with arbitrary precision (e.g., height to the nearest millimeter, temperature to many decimal places).
- Practical guidance for analysts:
- When you look at a dataset, determine whether the data are discrete or continuous to decide appropriate methods (e.g., mean, median, mode, histograms, t-tests, etc.).
- For qualitative data, decide whether the measure is nominal (no order) or ordinal (order exists) to choose suitable nonparametric or ordinal methods.
- Remember: some data may require proxies to be used in arithmetic (e.g., coding categories as numbers), but always be cautious with interpreting such numbers.
Examples and Illustrative Scenarios
- Nominal examples: gender; eye color; type of cuisine; brand of computer.
- Ordinal examples: freshman/sophomore/junior/senior; customer satisfaction levels; hotel star ratings; racing place (1st, 2nd, 3rd).
- Interval examples: Celsius/Fahrenheit temperatures; calendar years (2025, 2015, etc.).
- Ratio examples: height, weight, duration, count data (courses taken, items sold).
- Clarifications:
- You can sort or categorize nominal data but cannot meaningfully perform arithmetic on them unless you assign a numeric proxy for analysis (which requires caution).
- Interval data permit meaningful subtraction and averaging, but zero does not imply absence of the quantity.
- Ratio data permit meaningful ratios (e.g., twice as tall, half as long) because zero represents none of the quantity.
Ethics in Data Work
- Ethics define values and assumptions guiding data collection, analysis, and reporting.
- Core ethical concepts (not all quiz-covered at this level, but important):
- Informed consent: participants understand how their data will be used and agree to it.
- Confidentiality: protect identities and sensitive information; do not disclose data improperly.
- Integrity and accuracy: report data honestly; avoid fabrication or manipulation.
- Transparency: document methods; be clear about data sources and procedures.
- Reproducibility: others should be able to replicate results given the same data and methods.
- Fairness and equity: avoid sampling biases and selective reporting; represent the population accurately.
- Responsible interpretation: avoid cherry-picking results to tell a preferred story; acknowledge uncertainty and limitations.
- Communicating results: present findings clearly and ethically, with appropriate caveats.
- Relevance to business analytics:
- These values underlie the use of statistics in business decisions and research.
- The goal is to apply statistics to real-world business problems responsibly and transparently.
Business Analytics Framework (Descriptive, Predictive, Prescriptive)
- Descriptive analytics: describes past performance; summarizes data to understand what happened.
- Predictive analytics: uses data to forecast future outcomes with a measure of probability; never a guaranteed outcome.
- Probability-focused approach to predicting outcomes (eventualities) rather than certainties.
- Prescriptive analytics: recommends actions based on data analysis (e.g., pricing changes, marketing emphasis).
- Translates predictions and insights into concrete business decisions.
- The broader analytics ecosystem (high-level):
- Data collection and cleaning → Descriptive analytics → Inferential statistics and modeling → Predictive analytics → Prescriptive analytics → Decision-making and implementation.
- Practical note for exams: expect to focus on descriptive and inferential statistics, population vs sample, and data levels (nominal/ordinal/interval/ratio).
- The course emphasizes applying statistics with Excel for business contexts and distinguishing between descriptive, predictive, and prescriptive analytics.
Guided Lab, Quiz, and Real-World Application
- The instructor plans a guided lab and a 15-minute statistics quiz; if time is short, a follow-up video walkthrough of the guided lab steps will be provided.
- The slide deck and readings for the week contain the essential content for today’s quiz; focus areas include:
- Descriptive vs inferential statistics.
- Population and sample definitions and distinctions.
- Levels of measurement: nominal, ordinal, interval, ratio.
- Examples illustrating each data type and their implications for analysis.
- Real-world takeaway: these concepts form the foundation for business analytics and the use of tools like Excel to support data-driven decisions.
- Population mean: ar{X}{ ext{pop}} = rac{1}{N} \sum{i=1}^{N} X_i
- Sample mean: ar{x} = rac{1}{n} \sum{i=1}^{n} xi
- Correlation (example):
r = rac{ ext{Cov}(X,Y)}{\sigmaX \sigmaY} = rac{rac{1}{n-1}\sum{i=1}^{n} (xi - ar{x})(y_i - ar{y})}{ ext{(standard deviations product)}} - Levels of measurement recap:
- Nominal: categories with labels, no natural order.
- Ordinal: ordered categories, intervals not necessarily uniform.
- Interval: numeric with meaningful differences, zero is arbitrary.
- Ratio: numeric with a true zero; allows meaningful ratios.
- Takeaway notation:
- Population size: N
- Sample size: n
- Example data types: height, weight, duration (ratio); temperature (interval); satisfaction ratings (ordinal); gender (nominal).