EW

data sci week 3 intro

Overview of Mean and Median

  • Mean: Average of a set of numbers calculated by dividing the sum of the numbers by the quantity of numbers.

  • Median: Middle value when a data set is ordered. If there’s an even number of observations, average the two middle values.

  • Both represent different measures of central tendency.

Finding Mean and Median

  • Finding Mean: Sum all ages and divide by the number of people in the data set.

  • Finding Median:

    • For odd numbers: Order data and find the middle number.

    • For even numbers: Average the two middle numbers.

    • Example: If there are 5 people, the median is the 3rd person. For 6 people, take the 3rd and 4th persons, then average them.

Importance of Data Ordering

  • Understanding rank in ordered data is critical for choosing the median.

  • In a histogram, the median value splits data into two equal parts with 50% below and 50% above it.

  • Note: Median is not the balancing point like the mean.

Reporting Mean vs. Median

  • Mean: Sensitive to extreme values/outliers, thus can be skewed by high or low observations.

  • Median: More robust against outliers, better for understanding central tendency in skewed data.

  • **When to Use: ** - Average buyer benefits from median for property pricing, while agents may prefer the mean to emphasize higher sales.

Box Plot Analysis

  • In box plots, the median is marked, but you often need to plot the mean separately.

  • Robustness of Median: Median provides a more accurate representation when data is skewed.

  • Outliers: Identified using IQR (upper threshold = Q3 + 1.5 * IQR; lower threshold = Q1 - 1.5 * IQR).

Variability Measures

  • Standard Deviation (SD): Indicates spread around the mean; sensitive to outliers.

  • Interquartile Range (IQR): Difference between Q3 and Q1; less sensitive to outliers and indicates middle 50% spread.

  • Coefficient of Variation (CV): Measures relative variability; calculated as SD divided by the mean, useful for comparing data sets of different units or means.

Adding or Removing Data Points

  • Impact of outliers: Adjusting extreme data points affects the mean significantly more than the median.

  • Report findings with both mean (SD) and median (IQR) for comprehensive analysis.

Data Scaling

  • Effect on Mean vs. Standard Deviation:

    • Shifting data (adding a constant value) affects the mean, not the SD.

    • Scaling data (multiplying by a factor) affects both.

  • Important for analyzing trends when changes in data occur due to market factors.

Practicing Data Analysis in R

  • Use RStudio for calculating mean, median, SD, and IQR:

    • Functions: mean(), median(), sd(), IQR(), quantile(), etc.

  • Importance of checking variable types (quantitative vs. qualitative) when analyzing datasets.

Pairing Measures of Central Tendency and Spread

  • The robustness of pairing median with IQR and mean with SD ensures balanced interpretation of data.

  • Always report both center and spread together for a clear understanding of the data distribution.