Measures of Variance, Quartiles, and Box Plots

Data File Preparation

  • The session uses the same movie data file as previous sessions.
  • Initial Editing: To focus on relevant data, the following columns are deleted:
    • Box Office data (Column H).
    • Year data (Column G).

Measures of Variance: Range and Interquartile Range (IQR)

Revisiting Range
  • Definition: The range is the difference between the minimum and maximum values in a dataset.
  • Limitations: Simple range is not a good measure of variance because it is highly affected by outliers.
Introducing Quartiles
  • Concept Connection: Building on the idea of the median, which represents the 50th50^{th} percentile (half the data is smaller, half is larger).
  • Percentiles and Quartiles:
    • First Quartile (Q1Q1): The value at the 25th25^{th} percentile; the lowest quarter of the data.
    • Second Quartile (Q2Q2): The median; the value at the 50th50^{th} percentile.
    • Third Quartile (Q3Q3): The value at the 75th75^{th} percentile.
    • Fourth Quartile (Q4Q4): The maximum value; the value at the 100th100^{th} percentile.
Calculating Min, Max, and Quartiles in Excel
  • Minimum Value: Use the formula =MIN(data)=MIN(data). For Metacritic data (Column F): =MIN(F:F)=MIN(F:F). Result: 1313.
  • Maximum Value: Use the formula =MAX(data)=MAX(data). For Metacritic data (Column F): =MAX(F:F)=MAX(F:F). Result: 9090.
  • Quartiles: Use the formula =QUARTILE.INC(array,quart)=QUARTILE.INC(array, quart).
    • Important Note for K300: Always use QUARTILE.INCQUARTILE.INC (inclusive version) when calculating quartiles in Excel. The difference between .INC and .EXE is minor but consistency is key.
    • array: The range of data, e.g., F:FF:F for Metacritic data.
    • quart: A number indicating which quartile to return:
      • 00: Minimum value (e.g., =QUARTILE.INC(F:F,0)=QUARTILE.INC(F:F, 0) gives 1313).
      • 11: First Quartile (25th25^{th} percentile). =QUARTILE.INC(F:F,1)=QUARTILE.INC(F:F, 1). Result: 4040.
      • 22: Median (50th50^{th} percentile).
      • 33: Third Quartile (75th75^{th} percentile). =QUARTILE.INC(F:F,3)=QUARTILE.INC(F:F, 3). Result: 6464.
      • 44: Maximum value (e.g., =QUARTILE.INC(F:F,4)=QUARTILE.INC(F:F, 4) gives 9090).
Interquartile Range (IQR)
  • Definition: The IQR is the difference between the third quartile and the first quartile: IQR=Q3Q1IQR = Q3 - Q1.
  • Calculation for Metacritic: 6440=2464 - 40 = 24.
  • Significance: The IQR is a more robust measure of variability compared to the simple range because it is less affected by outliers. It focuses on the spread of the middle 50th50^{th} of the data, specifically between the 25th25^{th} and 75th75^{th} percentiles, making it a "less fragile" version of range.

Box and Whisker Plots (Box Plots)

Creating a Box Plot in Excel
  1. Select Data: Highlight the column(s) of data you want to plot (e.g., Metacritic data, Column F).
  2. Insert Chart: Go to Insert > Charts section > Statistical Charts > Box and Whisker.
Interpreting a Box Plot
  • The Box:
    • Bottom of the Box: Represents the First Quartile (Q1Q1) (e.g., 4040 for Metacritic).
    • Top of the Box: Represents the Third Quartile (Q3Q3) (e.g., 6464 for Metacritic).
    • Height of the Box: The Interquartile Range (IQR), indicating the spread of the middle 50th50^{th} of the data.
    • Line Inside the Box: Represents the Median (Q2Q2 or 50th50^{th} percentile). This shows the central tendency of the data.
  • Other Elements:
    • 'x' Mark: Represents the Mean of the data. This provides another measure of central tendency.
    • Whiskers: In Excel's box plots, these do not necessarily extend to the minimum and maximum values. They typically extend to 1.5imesIQR1.5 imes IQR beyond the quartiles. For K300, focus on the box; the precise whisker calculation is less critical than understanding the box's meaning.
  • Insights from Box Plots:
    • Variability: The height of the box (IQR) gives an immediate visual sense of data spread. Taller boxes indicate more variability.
    • Central Tendency: The median line and the mean 'x' show the typical value.
    • Distribution Shape (Skewness): The relationship between the mean and median can indicate skewness.
      • If the mean and median are close (like in the Metacritic example, suggesting symmetric data), the distribution is likely symmetric.
      • If the mean ('x') is noticeably above the median, the distribution is positively skewed (tail in the positive direction).
      • If the mean ('x') is noticeably below the median, the distribution is negatively skewed (tail in the negative direction).
Creating and Interpreting Multi-Series Box Plots
  1. Select Multiple Columns: Highlight all columns of data you want to compare (e.g., RT Critic, RT Fan, IMDb, Metacritic).
  2. Insert Box Plot: Follow the same steps as above.
  3. Add a Legend: To identify which box corresponds to which dataset, click on the chart, then Add Chart Element > Legend.
  4. Comparing Datasets: Multi-series box plots allow for visual comparison of:
    • Variability: Taller boxes (larger IQR) indicate greater variability. For example, RT Critic had a taller bar, indicating more variability than others.
    • Central Tendency: Compare the median lines and mean 'x' marks across different plots to see which datasets have higher or lower typical values.
Example: Comparing Variability with Standard Deviation
  • Standard Deviation Calculations:
    • Metacritic (Column F): =STDEV.P(F:F)=STDEV.P(F:F)
    • RT Critic (Column C): =STDEV.P(C:C)=STDEV.P(C:C). Result: 2626
    • RT Fan (Column D): =STDEV.P(D:D)=STDEV.P(D:D)
    • IMDb (Column E): =STDEV.P(E:E)=STDEV.P(E:E)
  • Observation: The RT Critic data had a standard deviation of 2626, which was higher than other datasets. Correspondingly, its box plot visually showed a taller box, reinforcing the idea that a larger standard deviation (or IQR) means more variability in the data. This direct correlation between numerical measures of variance and their visual representation in box plots is crucial for understanding data spread.