data sci week 3 intro
Overview of Mean and Median
Mean: Average of a set of numbers calculated by dividing the sum of the numbers by the quantity of numbers.
Median: Middle value when a data set is ordered. If there’s an even number of observations, average the two middle values.
Both represent different measures of central tendency.
Finding Mean and Median
Finding Mean: Sum all ages and divide by the number of people in the data set.
Finding Median:
For odd numbers: Order data and find the middle number.
For even numbers: Average the two middle numbers.
Example: If there are 5 people, the median is the 3rd person. For 6 people, take the 3rd and 4th persons, then average them.
Importance of Data Ordering
Understanding rank in ordered data is critical for choosing the median.
In a histogram, the median value splits data into two equal parts with 50% below and 50% above it.
Note: Median is not the balancing point like the mean.
Reporting Mean vs. Median
Mean: Sensitive to extreme values/outliers, thus can be skewed by high or low observations.
Median: More robust against outliers, better for understanding central tendency in skewed data.
**When to Use: ** - Average buyer benefits from median for property pricing, while agents may prefer the mean to emphasize higher sales.
Box Plot Analysis
In box plots, the median is marked, but you often need to plot the mean separately.
Robustness of Median: Median provides a more accurate representation when data is skewed.
Outliers: Identified using IQR (upper threshold = Q3 + 1.5 * IQR; lower threshold = Q1 - 1.5 * IQR).
Variability Measures
Standard Deviation (SD): Indicates spread around the mean; sensitive to outliers.
Interquartile Range (IQR): Difference between Q3 and Q1; less sensitive to outliers and indicates middle 50% spread.
Coefficient of Variation (CV): Measures relative variability; calculated as SD divided by the mean, useful for comparing data sets of different units or means.
Adding or Removing Data Points
Impact of outliers: Adjusting extreme data points affects the mean significantly more than the median.
Report findings with both mean (SD) and median (IQR) for comprehensive analysis.
Data Scaling
Effect on Mean vs. Standard Deviation:
Shifting data (adding a constant value) affects the mean, not the SD.
Scaling data (multiplying by a factor) affects both.
Important for analyzing trends when changes in data occur due to market factors.
Practicing Data Analysis in R
Use RStudio for calculating mean, median, SD, and IQR:
Functions:
mean()
,median()
,sd()
,IQR()
,quantile()
, etc.
Importance of checking variable types (quantitative vs. qualitative) when analyzing datasets.
Pairing Measures of Central Tendency and Spread
The robustness of pairing median with IQR and mean with SD ensures balanced interpretation of data.
Always report both center and spread together for a clear understanding of the data distribution.