Notes on Data Exploration for Exam Preparation

Frequency Distributions for Organizing and Summarizing Data
  • Frequency Distribution: Shows how data are partitioned among several categories (or classes). - Lists categories and the number (frequency) of data values in each category.

  • Example: Daily Commute Times in Los Angeles- Data: Divide data into seven classes, compute the frequency of each.

    • Classes:

    • 0-14: 6

    • 15-29: 18

    • 30-44: 14

    • 45-59: 5

    • 60-74: 5

    • 75-89: 1

    • 90-104: 1

Relative Frequency Distribution
  • Relative Frequency Distribution: Each class frequency replaced by relative frequency (percentage). - Calculation: Relative frequency for a class = (frequency for class) / (sum of all frequencies)

    • Example for Los Angeles Commute Times

    • 0-14: 12%

    • 15-29: 36%

    • 30-44: 28%

    • 45-59: 10%

    • 60-74: 10%

    • 75-89: 2%

    • 90-104: 2%

  • Sum of relative frequencies should be close to 100%.

  • Comparison of Data: Combine two or more relative frequency distributions in one table for easier comparison.

Histograms
  • Definition: A graph consisting of bars of equal width drawn adjacent to each other; horizontal scale represents classes of data values, while vertical scale represents frequencies.

  • Uses: - Visually displays the shape of data distribution.

    • Shows location of data center and spread.

    • Identifies outliers.

  • Important Characteristics:- The shape can indicate normal, uniform, or skewed distributions.

Scatterplots, Correlation, and Regression
  • Scatterplot: Plot of paired (x, y) quantitative data.

  • Correlation: Exists if there is a pattern approximated by a straight line.- Positive correlation: Both variables increase together.

    • Negative correlation: One variable increases as the other decreases.

  • Correlation Coefficient (r): Measures the strength of the linear association between two variables. - Ranges from -1 to 1; close to 1 or -1 indicates strong correlation.

Measures of Center
  • Mean: The sum of all data values divided by the number of values.

    • Formula: Mean = (Σxi) / n, where Σxi is the sum of all data values and n is the number of values. Not resistant to outliers; one extreme value can significantly affect the mean.

  • Median: The middle value of data when sorted. Resistant to extreme values.

  • Mode: The value(s) that occur most frequently in the data set; can be unimodal, bimodal, or multimodal.

Variability Measures
  • Range: Difference between maximum and minimum values; sensitive to extremes.

    • Formula: Range = max value - min value

  • Standard Deviation (s): Measures the spread of data values from the mean; not resistant to outliers.

    • Formula: s = sqrt( (Σ(xi - mean)^2) / (n - 1) ), where xi is each data value and mean is the average value.

Percentiles and Quartiles
  • Percentiles: Divide a dataset into 100 equal parts. Example: 72nd percentile indicates a value above 72% of the scores.

  • Quartiles: Split data into four groups. Q1: 25th percentile, Q2: 50th percentile (median), Q3: 75th percentile. - Interquartile Range (IQR): Q3 - Q1, measures the middle 50% of data.

Boxplots and Outliers
  • Boxplot: Visual representation showing the distribution based on the five-number summary (minimum, Q1, median, Q3, maximum).- Useful for identifying the skewness of the data and outliers.

    • Outliers: Values significantly different from others; should be investigated as they may provide meaningful insights.

Caution on Data Interpretation
  • Always consider context and methodology when analyzing measures of center and variability.

  • Some statistics may not provide meaningful insights depending on the nature of the data (e.g., categorical data).

  • Ensure thorough analysis of graphs to detect any misleading presentations (e.g., nonzero vertical axes).