LM

Stats unit 2 flashcards

Frequency Table A table that organizes data into categories (bins) and shows how often each category appears. May include relative frequency and cumulative frequency.

Relative Frequency The proportion of occurrences in each category, calculated as (class frequency / total data size). It is a property of the data set, not a statistic.

Cumulative Frequency The sum of the frequencies of a given category and all previous categories. It can also be shown as a cumulative relative frequency.

Stemplot (Stem-and-Leaf Plot) A way to organize data by separating each value into a "stem" (leading digit(s)) and "leaf" (trailing digit). Helps visualize frequency distributions.

Modified Stemplot A variation of a stemplot used for categorical data, where stems represent categories and leaves represent frequency markers or individual data points.

Bar Graph A graph that represents categorical data with bars. The x-axis shows categories, and the y-axis shows frequency or relative frequency. Bars do not touch.

Pareto Graph A bar graph where categories are ordered from most to least frequent. Helps visualize the most significant contributors to a dataset.

Histogram A graph used for numerical data where bars represent frequency distributions. Unlike bar graphs, the bars touch, indicating a continuous data range.

Bins (Classes) Grouped ranges of numerical data used in histograms and frequency tables. Bins should have equal widths to avoid misleading graphs.

Range (of Data) Calculated as the difference between the largest and smallest values in a dataset. Helps determine bin width in histograms.

Rule of Thumb for Histogram Bins Use between 3 and 20 bins, or approximately the square root of the sample size (√n), for an appropriate number of categories.

Class Boundaries The precise limits of histogram bins, ensuring no gaps between categories. For continuous data, the upper limit of one class is the lower limit of the next.

Ordered Stemplot A stemplot where leaves are arranged in ascending order within each stem. Makes patterns in data easier to identify.

Cumulative Relative Frequency The cumulative frequency divided by the total number of observations, showing the proportion of data at or below a given category.

Class Width The range covered by each bin in a histogram. Ideally calculated as (Range / Number of Bins).

DESCRIBING DATA: Measures of Center and Variability

I. Measures of Center (Measures of Central Tendency)

  • The most useful and frequently used measure of center.

  • Advantage: The mean of a random sample is an unbiased estimate of the population mean and has the lowest variability among other estimates.

  • Disadvantage: Highly influenced by outliers.

  • It is the "balancing point" of a distribution, calculated as:
    or using frequencies/relative frequencies:

  • The term "expected value" refers to the long-term mean.

Example: Data: 12, 24, 35, 24, 90, 35, 35, 18, 34

  • Mean Calculation:

  • Advantage: Not significantly influenced by outliers.

  • Disadvantage: Not an unbiased estimate of the population median.

  • Found by ordering data and selecting the middle value.

    • If is odd: median = -th value.

    • If is even: median = average of -th and -th values.

Example: Data: 12, 24, 35, 24, 90, 35, 35, 18, 34

  • Ordered: 12, 18, 24, 24, 34, 35, 35, 35, 90

  • Median = 34

The Mode
  • The most frequent value.

  • Advantage: Useful for categorical data.

  • Disadvantage: Not always a "typical" value and may not represent the dataset well.

Example: Data: 12, 24, 35, 24, 90, 35, 35, 18, 34, 24, 79

  • Mode: 24 (most frequent)

II. Shape of the Distribution

  • Symmetric: Mean ≈ Median.

  • Left Skewed (Negatively Skewed): Mean < Median.

  • Right Skewed (Positively Skewed): Mean > Median.

  • Uniform: No clear peaks, Mean ≈ Median.

  • Multimodal: Multiple peaks, hard to determine skewness.

III. Measures of Variability

Variance and Standard Deviation
  • Symbols:

    • Population variance: , Population standard deviation:

    • Sample variance: , Sample standard deviation:

  • Advantages: Unbiased estimate of population variance, accounts for all values.

  • Disadvantages: Influenced by outliers, complex calculation.

  • Variance formula:

  • Standard deviation:

Example: Data: 1, 2, 3, 4

  • Mean:

  • Variance:

Quartiles and Interquartile Range (IQR)
  • Used in boxplots.

  • Quartiles:

    • = 25th percentile (lower quartile)

    • = 50th percentile (median)

    • = 75th percentile (upper quartile)

  • Interquartile Range:

  • Example: Data: 12, 24, 35, 24, 90, 35, 35, 18, 34, 24, 79

    • Ordered: 12, 18, 24, 24, 24, 34, 35, 35, 35, 79, 90

    • , ,

    • IQR =

Range
  • Formula:

  • Example: Data: 92 lbs. to 190 lbs.

    • Range = lbs.

  • Advantage: Easy to calculate.

  • Disadvantage: Sensitive to outliers.

IV. Boxplots (Box-and-Whisker Plots)

  • Constructed from a 5-number summary:

  • Process:

    1. Order the data and determine the five-number summary.

    2. Draw a box from to with a line at .

    3. Extend whiskers to the minimum and maximum values.

Example Data: 12, 24, 35, 24, 90, 35, 35, 18, 34, 24, 79

  • Five-Number Summary: Min = 12, , Median = 34, , Max = 90

Practice Data:

305, 658, 566, 505, 466, 344, 648, 400 (Depths of schools of fish on Lake Superior)

  • Calculate mean, median, mode, variance, standard deviation, quartiles, and construct a boxplot.


This document provides an overview of descriptive statistics focusing on measures of center, shape, and variability of distributions. Let me know if you need additional clarifications or examples!