PSBE Chapter 1.1–1.4 Study Notes (Data, Graphs, Describing Distributions, Normal Distributions)
Data basics and getting started with distributions (PSBE Chapter 1.1)
- Data and variables
- Data describe objects, people, places, or situations.
- Goals when studying data:
- Collect and organize the data.
- Start investigating with a graph.
- Compute numerical summaries to describe the data.
- Look for overall patterns and deviations from the pattern.
- Use a statistical model as appropriate.
- Key terms (1 of 2)
- Cases: the objects described by a set of data (e.g., customers, cities, patients, cars).
- Variable: a characteristic of a case (e.g., profit, duration of a service call, number of customers, gender).
- Different cases can have different values for the variables.
- Example: Real estate firm (1.1)
- Cases: Clients
- Variable: Referral source
- Values: Previous client, vendor, friend of realtor or staff, Internet advertisement, yard sign
- Types of variables
- Quantitative variable: takes numerical values with arithmetic; examples: age, credit card balance, number of employees, time until served.
- Categorical variable: places a case into categories (ex: gender, brand, own a home yes/no).
- Example: Credit card spending study (1.1)
- Population: 21- to 25-year-old cardholders with $1000 limit; sample size: 100
- For each person: items recorded
- Items and expected variable types (credit card specific):
- Average balance over last year
- Type: Quantitative
- Possible values: $0.00 through $1000.00
- Ever late payments
- Type: Categorical
- Possible values: Yes, No
- Day of week most used
- Type: Categorical
- Possible values: Sunday, Monday, …, Saturday
- Age (in years)
- Type: Quantitative
- Possible values: integers 21, 22, 23, 24, 25
- Quick questions to understand a data set
- Who? What cases do the data describe? How many cases?
- What? How many variables? What is the exact definition and unit of each variable?
- Why? What is the purpose and what questions are being asked? Are the variables suitable?
Displaying distributions with graphs (PSBE Chapter 1.2)
- Objectives and overview
- Display distributions for categorical data: bar graphs, pie charts.
- Display distributions for quantitative data: histograms, stemplots, time plots.
- Interpret histograms and contrast with stemplots.
- Important terms (2 of 2)
- Explanatory data analysis: examining and describing features of a data set.
- Distribution of a variable: values the variable takes and how often it takes them.
- Distribution of a categorical variable: lists categories and shows counts or percentages.
- Distribution of a quantitative variable: often shows ranges and frequencies.
- Displaying categorical data
- Purpose: summarize the data so characteristics of the distribution are clear.
- Process: list categories and give counts or percents per category.
- Methods: Bar graphs, Pie charts.
- Example: Marital status (categories: Married, Never married, Divorced, Widowed).
- Data summarized in table form (example counts in millions).
- Ordering categories is flexible (alphabetical, by value, by year, etc.).
- Examples
- Online research: Locations given by college students as their favorite source for online research.
- Ways to chart quantitative data
- Histograms and stemplots: single-variable summaries.
- Time plots: measurements over time; line emphasizes change.
- Histograms
- Construct by dividing the value range into equal-width classes and counting observations per class.
- Steps to create:
- Divide range into equal-width classes.
- Count observations in each class.
- Mark x-axis with class widths; scale y-axis; draw a bar for each class.
- Class size: start with 5–10 classes and adjust; there is no single perfect choice.
- Stemplots vs histograms
- Stemplots are quick, hand-done summaries that show actual data values; useful for rough calculations.
- They are less common in publications.
- Stemplot construction and considerations
- Group data by leading digits (stems) and leaves (final digits).
- Steps: split leading digits, write stems, place leaves in increasing order to the right.
- Advantages: shows actual data values; quick pattern checks.
- Limitations: not ideal for large data sets; digits can be rounded; stems can be split for many observations.
- Interpreting histograms
- Look for overall pattern: shape, center, spread.
- Common patterns: skewed right, skewed left, symmetric; many shapes can be complex.
- Outliers: deviations that lie far from the main pattern.
- Example note: Alaska and Florida may show unusual elderly representations; large gaps can indicate outliers.
- Time plots
- Time on x-axis; the variable of interest on y-axis.
- Look for trend (persistent rise/fall) and seasonal variation (regular intervals).
- Scales matter: axis scaling can affect interpretation of the graph.
- Practical tips
- A picture helps, but hard numbers matter; check the scales to avoid misinterpretation.
Describing distributions with numbers (PSBE Chapter 1.3)
- Measures of center
- Mean (arithmetic average):
- Formula: ar{x} = rac{1}{n}
left( ext{sum of all values}
ight) - Example: May include a calculation where the mean is 16.292 days (from a sum of 391 over 24 cases).
- Median
- Definition: the midpoint where half the observations are below and half above.
- The median is resistant to skew and outliers; the mean is not.
- Comparing mean and median
- Symmetric distributions: mean and median are close.
- Skewed distributions: the mean is pulled toward the tail; the median remains closer to the center.
- Outliers: the mean can be heavily influenced by outliers, while the median largely resists them.
- Example visuals: symmetric vs right-skewed vs skewed with outliers (illustrative descriptions).
- When to report mean vs median
- Realtor example: home prices produce mean and median values; discuss which is more attractive to buyers vs sellers; often report both.
- Middletown income example: mean for total tax base; median for typical living standards; suggests choosing based on purpose.
- Measuring spread: percentiles and quartiles
- Percentiles: arrange data, determine position corresponding to a percentage; there may not be an exact observation at the exact percentile.
- Quartiles: Q1 is the 25th percentile; Q3 is the 75th percentile; defined as medians of lower/upper halves (excluding the overall median).
- Five-number summary and boxplots
- Five-number summary: min, Q1, median (M), Q3, max.
- Boxplots visually display the five-number summary and can reveal symmetry/skewness.
- Boxplots and comparisons
- Side-by-side boxplots compare distributions across groups.
- Outliers and the 1.5 IQR rule
- Suspected outliers can be flagged if they fall more than 1.5 times the IQR above Q3 or below Q1:
- Rule: an observation is a suspected outlier if it lies beyond Q3 + 1.5 \times IQR or below Q1 - 1.5 \times IQR, where IQR = Q3 - Q1.
- The standard deviation (spread around the mean)
- Definition: measures the average distance of observations from the mean.
- Formula (sample): s = ext{standard deviation} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2}
- Steps: compute the variance s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}, then take the square root.
- Example (hourly wages):
- Mean = \bar{x} = 16.33 (dollars per hour, in the example)
- Sum of squared deviations = 199.99
- Degrees of freedom: df = n-1 = 8
- Variance: s^2 = \frac{199.99}{8} = 25.00
- Standard deviation: s = \sqrt{25.00} = 5.00
- Properties and usage of standard deviation
- Use s to describe spread when the mean is the chosen center.
- s is not resistant to outliers and skew; it has the same units as the data.
- s = 0 only when all observations are identical (no spread).
- Choosing measures of center and spread
- If the data are fairly symmetric with no outliers, use mean and standard deviation.
- If skew or outliers are present, report the five-number summary (min, Q1, median, Q3, max) and consider a boxplot.
- In practice, report both when appropriate to let the reader decide.
- Example contrasts and guidance
- Real estate or income data examples illustrate how the mean can be sensitive to extreme values while the median remains robust.
- Practical note on reporting
- Always consider the distribution shape and the presence of outliers when choosing summary statistics.
The Normal distributions (PSBE Chapter 1.4)
- Density curves and key properties
- A density curve is a model for a distribution with total area under the curve equal to 1.
- The area under the curve over a range gives the proportion of observations in that range.
- The mean and median of a density curve: the mean is the balance point; the median is the equal-areas point.
- For symmetric density curves, mean = median.
- For skewed curves, the mean is pulled toward the long tail.
- Normal distributions
- Family: X \sim N(\mu, \sigma) with density shaped like a bell.
- Common constants: e \approx 2.71828…, \pi \approx 3.14159…
- The 68-95-99.7 rule: approximately
- P(|X-\mu| \le \sigma) \approx 0.68
- P(|X-\mu| \le 2\sigma) \approx 0.95
- P(|X-\mu| \le 3\sigma) \approx 0.997
- The standard normal distribution
- Standardization: convert any normal to the standard normal Z \sim N(0,1) via
- Z = \frac{X - \mu}{\sigma}
- This allows comparison across different normal distributions.
- Example: women’s heights
- Heights follow X \sim N(\mu=64.5, \sigma=2.5) inches.
- Probability that a woman is shorter than 67 inches: P(X < 67).
- Compute z for 67: z = \frac{67 - 64.5}{2.5} = 1.
- By the 68-95-97.7 rule or Table A, P(X < 67) ≈ 0.84 (more precisely 0.8413).
- Using Table A (the standard normal table)
- The area to the left of a z-value gives the cumulative probability up to that z.
- For z = 1.00, area to the left ≈ 0.8413.
- Inverse normal calculations (finding x given a proportion)
- Process: locate the desired proportion in Table A (the area to the left); read the corresponding z-value; then unstandardize:
- Formula: x = \mu + zp \sigma where zp is the z-value with area p to the left.
- Tips for Table A and common calculations
- Because the normal distribution is symmetric, to find the area to the right of a z-value, either use 1 minus the left-area or use symmetry.
- Area between two z-values: compute left areas for each and subtract: \text{area}(z1 \text{ to } z2) = \text{area left of } z1 - \text{area left of } z2
- Real-world examples and applications
- NCAA SAT qualifiers example (top-level): require a score threshold; given a mean and sd for SAT, compute the proportion above a threshold by converting to z and using Table A or standard normal calculations.
- Example: NCAA threshold requires a combined SAT of at least 820 for a partial qualifier; using normal approximation you can compute the proportion in that range.
- SAT Verbal example: distribution approximates N(505, 110). To be in the top 10%, z = 1.28; solving for x gives x ≈ 646.
- Practical notes on standardization and comparing distributions
- Standardizing allows comparing distributions with different centers and spreads on a common scale.
- TI-84 and Table A usage notes (summary)
- For normal calculations, you can use TI-84 or other calculators/software to find probabilities and z-values.
- Inverse normal practice examples
- Example: SAT Verbal distribution N(505, 110); to be in the top 10%, z = 1.28; unstandardize to obtain score ≈ 646.
- Summary one-liners
- The normal distribution is a cornerstone model for many natural phenomena due to its properties under central limit tendencies.
- Standardization (z-scores) enables cross-distribution comparisons and facilitates inverse-probability calculations.