chap 1 exploring data

Page 1: Exploring Data 1.1/2: Categorical Variables and Displaying Distributions with Graphs

Individuals and Variables

  • Individuals: Objects described by a set of data; can be people, animals, or things.

  • Variables: Characteristics of individuals that can take different values for different individuals.

Categorical and Quantitative Variables

  • Categorical Variable: Places individuals into groups or categories.

  • Quantitative Variable: Takes numerical values where arithmetic operations like adding and averaging make sense.

Distribution

  • The distribution of a variable indicates the values the variable takes and their frequencies.

Describing Overall Pattern of a Distribution – SOCS

  1. Spread: Identify the lowest and highest values in the dataset.

  2. Outliers: Determine if there are any unusual values.

  3. Center: Estimate the approximate average value of the data.

  4. Shape: Assess the graph for symmetry or skewness.

Outliers

  • An outlier is an observation that falls outside the overall pattern of the graph.

Describing the SHAPE of a Distribution

  • Symmetric Distribution: Mean = Median.

  • Skewed Left: Mean < Median.

  • Skewed Right: Mean > Median.

Time Plot

  • A time plot displays each observation against the time of measurement.

  • Mark time scale on the horizontal axis and variable of interest on the vertical axis; connecting points helps visualize patterns over time.

Page 2: Describing Distributions with Numbers 1.3

The Mean

  • To find the mean (x̄) of observations:

    • Add their values and divide by the number of observations: ![Mean Formula] [ x̄ = \frac{x_1 + x_2 + \ldots + x_n}{n} ]

The Median (M)

  • The median (M) is the middle value of a distribution:

    • Odd n: Center observation is at position (n + 1) / 2.

    • Even n: Median is the average of the two center observations (positions n/2 and n/2 + 1).

The Five-Number Summary

  • Five-number summary: Minimum, Q1, Median, Q3, Maximum, in order from smallest to largest:

    • Summary Format: Minimum – Q1 – M – Q3 – Maximum

The Quartiles (Q1 and Q3)

  • To find quartiles:

    • Arrange observations in order.

    • Q1: Middle of values below the median.

    • Q3: Middle of values above the median.

The Interquartile Range (IQR)

  • IQR: Distance between Q3 and Q1: [ IQR = Q3 - Q1 ]

Outliers: The 1.5 x IQR Criterion

  • Outlier condition: An observation is an outlier if it lies more than 1.5 x IQR below Q1 or above Q3.

  • Example with IQR = 12:

    • Low outlier cutoff: Q1 - 1.5 * IQR

    • High outlier cutoff: Q3 + 1.5 * IQR

Page 3: Describing Distributions with Numbers 1.3

Boxplot

  • A boxplot represents the five-number summary and plots outliers individually:

    • Central box spans the quartiles.

    • A line within the box indicates the median.

    • Lines extend from the box to smallest and largest observations.

The Standard Deviation (S or Sx)

  • The standard deviation measures the average squared deviation of observations from their mean: [ S = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} ]

Calculation of the Standard Deviation

  • Example data with mean = 4.8: | xi | xi - mean | (xi - mean)² | |------|-------------|----------------| | 6 | 1.2 | 1.44 | | 3 | -1.8 | 3.24 | | 8 | 3.2 | 10.24 | | 5 | 0.2 | 0.04 | | 2 | -2.8 | 7.84 |

  • Sum: 22.8, Standard deviation calculation yields: [ S = \sqrt{\frac{22.8}{4}} = 2.387 ]