PSBE Chapter 1.1–1.4 Study Notes (Data, Graphs, Describing Distributions, Normal Distributions)

Data basics and getting started with distributions (PSBE Chapter 1.1)

  • Data and variables
    • Data describe objects, people, places, or situations.
    • Goals when studying data:
    • Collect and organize the data.
    • Start investigating with a graph.
    • Compute numerical summaries to describe the data.
    • Look for overall patterns and deviations from the pattern.
    • Use a statistical model as appropriate.
  • Key terms (1 of 2)
    • Cases: the objects described by a set of data (e.g., customers, cities, patients, cars).
    • Variable: a characteristic of a case (e.g., profit, duration of a service call, number of customers, gender).
    • Different cases can have different values for the variables.
  • Example: Real estate firm (1.1)
    • Cases: Clients
    • Variable: Referral source
    • Values: Previous client, vendor, friend of realtor or staff, Internet advertisement, yard sign
  • Types of variables
    • Quantitative variable: takes numerical values with arithmetic; examples: age, credit card balance, number of employees, time until served.
    • Categorical variable: places a case into categories (ex: gender, brand, own a home yes/no).
  • Example: Credit card spending study (1.1)
    • Population: 21- to 25-year-old cardholders with $1000 limit; sample size: 100
    • For each person: items recorded
    • Items and expected variable types (credit card specific):
    • Average balance over last year
      • Type: Quantitative
      • Possible values: $0.00 through $1000.00
    • Ever late payments
      • Type: Categorical
      • Possible values: Yes, No
    • Day of week most used
      • Type: Categorical
      • Possible values: Sunday, Monday, …, Saturday
    • Age (in years)
      • Type: Quantitative
      • Possible values: integers 21, 22, 23, 24, 25
  • Quick questions to understand a data set
    • Who? What cases do the data describe? How many cases?
    • What? How many variables? What is the exact definition and unit of each variable?
    • Why? What is the purpose and what questions are being asked? Are the variables suitable?

Displaying distributions with graphs (PSBE Chapter 1.2)

  • Objectives and overview
    • Display distributions for categorical data: bar graphs, pie charts.
    • Display distributions for quantitative data: histograms, stemplots, time plots.
    • Interpret histograms and contrast with stemplots.
  • Important terms (2 of 2)
    • Explanatory data analysis: examining and describing features of a data set.
    • Distribution of a variable: values the variable takes and how often it takes them.
    • Distribution of a categorical variable: lists categories and shows counts or percentages.
    • Distribution of a quantitative variable: often shows ranges and frequencies.
  • Displaying categorical data
    • Purpose: summarize the data so characteristics of the distribution are clear.
    • Process: list categories and give counts or percents per category.
    • Methods: Bar graphs, Pie charts.
    • Example: Marital status (categories: Married, Never married, Divorced, Widowed).
    • Data summarized in table form (example counts in millions).
    • Ordering categories is flexible (alphabetical, by value, by year, etc.).
  • Examples
    • Online research: Locations given by college students as their favorite source for online research.
  • Ways to chart quantitative data
    • Histograms and stemplots: single-variable summaries.
    • Time plots: measurements over time; line emphasizes change.
  • Histograms
    • Construct by dividing the value range into equal-width classes and counting observations per class.
    • Steps to create:
    • Divide range into equal-width classes.
    • Count observations in each class.
    • Mark x-axis with class widths; scale y-axis; draw a bar for each class.
    • Class size: start with 5–10 classes and adjust; there is no single perfect choice.
  • Stemplots vs histograms
    • Stemplots are quick, hand-done summaries that show actual data values; useful for rough calculations.
    • They are less common in publications.
  • Stemplot construction and considerations
    • Group data by leading digits (stems) and leaves (final digits).
    • Steps: split leading digits, write stems, place leaves in increasing order to the right.
    • Advantages: shows actual data values; quick pattern checks.
    • Limitations: not ideal for large data sets; digits can be rounded; stems can be split for many observations.
  • Interpreting histograms
    • Look for overall pattern: shape, center, spread.
    • Common patterns: skewed right, skewed left, symmetric; many shapes can be complex.
    • Outliers: deviations that lie far from the main pattern.
    • Example note: Alaska and Florida may show unusual elderly representations; large gaps can indicate outliers.
  • Time plots
    • Time on x-axis; the variable of interest on y-axis.
    • Look for trend (persistent rise/fall) and seasonal variation (regular intervals).
    • Scales matter: axis scaling can affect interpretation of the graph.
  • Practical tips
    • A picture helps, but hard numbers matter; check the scales to avoid misinterpretation.

Describing distributions with numbers (PSBE Chapter 1.3)

  • Measures of center
    • Mean (arithmetic average):
    • Formula: ar{x} = rac{1}{n}
      left( ext{sum of all values}
      ight)
    • Example: May include a calculation where the mean is 16.292 days (from a sum of 391 over 24 cases).
  • Median
    • Definition: the midpoint where half the observations are below and half above.
    • The median is resistant to skew and outliers; the mean is not.
  • Comparing mean and median
    • Symmetric distributions: mean and median are close.
    • Skewed distributions: the mean is pulled toward the tail; the median remains closer to the center.
    • Outliers: the mean can be heavily influenced by outliers, while the median largely resists them.
    • Example visuals: symmetric vs right-skewed vs skewed with outliers (illustrative descriptions).
  • When to report mean vs median
    • Realtor example: home prices produce mean and median values; discuss which is more attractive to buyers vs sellers; often report both.
    • Middletown income example: mean for total tax base; median for typical living standards; suggests choosing based on purpose.
  • Measuring spread: percentiles and quartiles
    • Percentiles: arrange data, determine position corresponding to a percentage; there may not be an exact observation at the exact percentile.
    • Quartiles: Q1 is the 25th percentile; Q3 is the 75th percentile; defined as medians of lower/upper halves (excluding the overall median).
  • Five-number summary and boxplots
    • Five-number summary: min, Q1, median (M), Q3, max.
    • Boxplots visually display the five-number summary and can reveal symmetry/skewness.
  • Boxplots and comparisons
    • Side-by-side boxplots compare distributions across groups.
  • Outliers and the 1.5 IQR rule
    • Suspected outliers can be flagged if they fall more than 1.5 times the IQR above Q3 or below Q1:
    • Rule: an observation is a suspected outlier if it lies beyond Q3 + 1.5 \times IQR or below Q1 - 1.5 \times IQR, where IQR = Q3 - Q1.
  • The standard deviation (spread around the mean)
    • Definition: measures the average distance of observations from the mean.
    • Formula (sample): s = ext{standard deviation} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2}
    • Steps: compute the variance s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}, then take the square root.
    • Example (hourly wages):
    • Mean = \bar{x} = 16.33 (dollars per hour, in the example)
    • Sum of squared deviations = 199.99
    • Degrees of freedom: df = n-1 = 8
    • Variance: s^2 = \frac{199.99}{8} = 25.00
    • Standard deviation: s = \sqrt{25.00} = 5.00
  • Properties and usage of standard deviation
    • Use s to describe spread when the mean is the chosen center.
    • s is not resistant to outliers and skew; it has the same units as the data.
    • s = 0 only when all observations are identical (no spread).
  • Choosing measures of center and spread
    • If the data are fairly symmetric with no outliers, use mean and standard deviation.
    • If skew or outliers are present, report the five-number summary (min, Q1, median, Q3, max) and consider a boxplot.
    • In practice, report both when appropriate to let the reader decide.
  • Example contrasts and guidance
    • Real estate or income data examples illustrate how the mean can be sensitive to extreme values while the median remains robust.
  • Practical note on reporting
    • Always consider the distribution shape and the presence of outliers when choosing summary statistics.

The Normal distributions (PSBE Chapter 1.4)

  • Density curves and key properties
    • A density curve is a model for a distribution with total area under the curve equal to 1.
    • The area under the curve over a range gives the proportion of observations in that range.
    • The mean and median of a density curve: the mean is the balance point; the median is the equal-areas point.
    • For symmetric density curves, mean = median.
    • For skewed curves, the mean is pulled toward the long tail.
  • Normal distributions
    • Family: X \sim N(\mu, \sigma) with density shaped like a bell.
    • Common constants: e \approx 2.71828…, \pi \approx 3.14159…
    • The 68-95-99.7 rule: approximately
    • P(|X-\mu| \le \sigma) \approx 0.68
    • P(|X-\mu| \le 2\sigma) \approx 0.95
    • P(|X-\mu| \le 3\sigma) \approx 0.997
  • The standard normal distribution
    • Standardization: convert any normal to the standard normal Z \sim N(0,1) via
    • Z = \frac{X - \mu}{\sigma}
    • This allows comparison across different normal distributions.
  • Example: women’s heights
    • Heights follow X \sim N(\mu=64.5, \sigma=2.5) inches.
    • Probability that a woman is shorter than 67 inches: P(X < 67).
    • Compute z for 67: z = \frac{67 - 64.5}{2.5} = 1.
    • By the 68-95-97.7 rule or Table A, P(X < 67) ≈ 0.84 (more precisely 0.8413).
  • Using Table A (the standard normal table)
    • The area to the left of a z-value gives the cumulative probability up to that z.
    • For z = 1.00, area to the left ≈ 0.8413.
  • Inverse normal calculations (finding x given a proportion)
    • Process: locate the desired proportion in Table A (the area to the left); read the corresponding z-value; then unstandardize:
    • Formula: x = \mu + zp \sigma where zp is the z-value with area p to the left.
  • Tips for Table A and common calculations
    • Because the normal distribution is symmetric, to find the area to the right of a z-value, either use 1 minus the left-area or use symmetry.
    • Area between two z-values: compute left areas for each and subtract: \text{area}(z1 \text{ to } z2) = \text{area left of } z1 - \text{area left of } z2
  • Real-world examples and applications
    • NCAA SAT qualifiers example (top-level): require a score threshold; given a mean and sd for SAT, compute the proportion above a threshold by converting to z and using Table A or standard normal calculations.
    • Example: NCAA threshold requires a combined SAT of at least 820 for a partial qualifier; using normal approximation you can compute the proportion in that range.
    • SAT Verbal example: distribution approximates N(505, 110). To be in the top 10%, z = 1.28; solving for x gives x ≈ 646.
  • Practical notes on standardization and comparing distributions
    • Standardizing allows comparing distributions with different centers and spreads on a common scale.
  • TI-84 and Table A usage notes (summary)
    • For normal calculations, you can use TI-84 or other calculators/software to find probabilities and z-values.
  • Inverse normal practice examples
    • Example: SAT Verbal distribution N(505, 110); to be in the top 10%, z = 1.28; unstandardize to obtain score ≈ 646.
  • Summary one-liners
    • The normal distribution is a cornerstone model for many natural phenomena due to its properties under central limit tendencies.
    • Standardization (z-scores) enables cross-distribution comparisons and facilitates inverse-probability calculations.