Notes on Data Exploration for Exam Preparation
Frequency Distributions for Organizing and Summarizing Data
Frequency Distribution: Shows how data are partitioned among several categories (or classes). - Lists categories and the number (frequency) of data values in each category.
Example: Daily Commute Times in Los Angeles- Data: Divide data into seven classes, compute the frequency of each.
Classes:
0-14: 6
15-29: 18
30-44: 14
45-59: 5
60-74: 5
75-89: 1
90-104: 1
Relative Frequency Distribution
Relative Frequency Distribution: Each class frequency replaced by relative frequency (percentage). - Calculation: Relative frequency for a class = (frequency for class) / (sum of all frequencies)
Example for Los Angeles Commute Times
0-14: 12%
15-29: 36%
30-44: 28%
45-59: 10%
60-74: 10%
75-89: 2%
90-104: 2%
Sum of relative frequencies should be close to 100%.
Comparison of Data: Combine two or more relative frequency distributions in one table for easier comparison.
Histograms
Definition: A graph consisting of bars of equal width drawn adjacent to each other; horizontal scale represents classes of data values, while vertical scale represents frequencies.
Uses: - Visually displays the shape of data distribution.
Shows location of data center and spread.
Identifies outliers.
Important Characteristics:- The shape can indicate normal, uniform, or skewed distributions.
Scatterplots, Correlation, and Regression
Scatterplot: Plot of paired (x, y) quantitative data.
Correlation: Exists if there is a pattern approximated by a straight line.- Positive correlation: Both variables increase together.
Negative correlation: One variable increases as the other decreases.
Correlation Coefficient (r): Measures the strength of the linear association between two variables. - Ranges from -1 to 1; close to 1 or -1 indicates strong correlation.
Measures of Center
Mean: The sum of all data values divided by the number of values.
Formula: Mean = (Σxi) / n, where Σxi is the sum of all data values and n is the number of values. Not resistant to outliers; one extreme value can significantly affect the mean.
Median: The middle value of data when sorted. Resistant to extreme values.
Mode: The value(s) that occur most frequently in the data set; can be unimodal, bimodal, or multimodal.
Variability Measures
Range: Difference between maximum and minimum values; sensitive to extremes.
Formula: Range = max value - min value
Standard Deviation (s): Measures the spread of data values from the mean; not resistant to outliers.
Formula: s = sqrt( (Σ(xi - mean)^2) / (n - 1) ), where xi is each data value and mean is the average value.
Percentiles and Quartiles
Percentiles: Divide a dataset into 100 equal parts. Example: 72nd percentile indicates a value above 72% of the scores.
Quartiles: Split data into four groups. Q1: 25th percentile, Q2: 50th percentile (median), Q3: 75th percentile. - Interquartile Range (IQR): Q3 - Q1, measures the middle 50% of data.
Boxplots and Outliers
Boxplot: Visual representation showing the distribution based on the five-number summary (minimum, Q1, median, Q3, maximum).- Useful for identifying the skewness of the data and outliers.
Outliers: Values significantly different from others; should be investigated as they may provide meaningful insights.
Caution on Data Interpretation
Always consider context and methodology when analyzing measures of center and variability.
Some statistics may not provide meaningful insights depending on the nature of the data (e.g., categorical data).
Ensure thorough analysis of graphs to detect any misleading presentations (e.g., nonzero vertical axes).