chap 1 exploring data
Page 1: Exploring Data 1.1/2: Categorical Variables and Displaying Distributions with Graphs
Individuals and Variables
Individuals: Objects described by a set of data; can be people, animals, or things.
Variables: Characteristics of individuals that can take different values for different individuals.
Categorical and Quantitative Variables
Categorical Variable: Places individuals into groups or categories.
Quantitative Variable: Takes numerical values where arithmetic operations like adding and averaging make sense.
Distribution
The distribution of a variable indicates the values the variable takes and their frequencies.
Describing Overall Pattern of a Distribution – SOCS
Spread: Identify the lowest and highest values in the dataset.
Outliers: Determine if there are any unusual values.
Center: Estimate the approximate average value of the data.
Shape: Assess the graph for symmetry or skewness.
Outliers
An outlier is an observation that falls outside the overall pattern of the graph.
Describing the SHAPE of a Distribution
Symmetric Distribution: Mean = Median.
Skewed Left: Mean < Median.
Skewed Right: Mean > Median.
Time Plot
A time plot displays each observation against the time of measurement.
Mark time scale on the horizontal axis and variable of interest on the vertical axis; connecting points helps visualize patterns over time.
Page 2: Describing Distributions with Numbers 1.3
The Mean
To find the mean (x̄) of observations:
Add their values and divide by the number of observations: ![Mean Formula] [ x̄ = \frac{x_1 + x_2 + \ldots + x_n}{n} ]
The Median (M)
The median (M) is the middle value of a distribution:
Odd n: Center observation is at position (n + 1) / 2.
Even n: Median is the average of the two center observations (positions n/2 and n/2 + 1).
The Five-Number Summary
Five-number summary: Minimum, Q1, Median, Q3, Maximum, in order from smallest to largest:
Summary Format: Minimum – Q1 – M – Q3 – Maximum
The Quartiles (Q1 and Q3)
To find quartiles:
Arrange observations in order.
Q1: Middle of values below the median.
Q3: Middle of values above the median.
The Interquartile Range (IQR)
IQR: Distance between Q3 and Q1: [ IQR = Q3 - Q1 ]
Outliers: The 1.5 x IQR Criterion
Outlier condition: An observation is an outlier if it lies more than 1.5 x IQR below Q1 or above Q3.
Example with IQR = 12:
Low outlier cutoff: Q1 - 1.5 * IQR
High outlier cutoff: Q3 + 1.5 * IQR
Page 3: Describing Distributions with Numbers 1.3
Boxplot
A boxplot represents the five-number summary and plots outliers individually:
Central box spans the quartiles.
A line within the box indicates the median.
Lines extend from the box to smallest and largest observations.
The Standard Deviation (S or Sx)
The standard deviation measures the average squared deviation of observations from their mean: [ S = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} ]
Calculation of the Standard Deviation
Example data with mean = 4.8: | xi | xi - mean | (xi - mean)² | |------|-------------|----------------| | 6 | 1.2 | 1.44 | | 3 | -1.8 | 3.24 | | 8 | 3.2 | 10.24 | | 5 | 0.2 | 0.04 | | 2 | -2.8 | 7.84 |
Sum: 22.8, Standard deviation calculation yields: [ S = \sqrt{\frac{22.8}{4}} = 2.387 ]