Basic Statistical Concepts
Chapter Two: Basic Statistical Concepts
Overview of Statistical Concepts
Many students feel apprehensive about statistics.
The focus will be on basic statistical ideas, primarily: descriptive statistics, frequency, central tendency, and variability.
Descriptive Statistics
Descriptive statistics is a method used to summarize the data collected from experiments, making it easier to understand.
Components discussed include:
Frequency
Central tendency
Variability
Frequency
Frequency refers to how often a particular observation appears in the data.
Representation:
Generally represented using a histogram, which can be a bar chart or a line graph. Both convey the same information.
Axes:
Y-axis: Frequency of observations (how many times an observation occurred).
X-axis: Variable of interest (e.g., test scores).
Example:
If students take a standardized test, a histogram might show the following score distributions:
1 person scored 350.
2 people scored 400.
3 people scored 450.
4 people scored 500.
Types of Distributions
Most traits show a normal distribution, which is symmetric, meaning:
If folded in half, both sides are congruent.
Example: Height of individuals likely to follow a normal distribution, with most individuals around an average height and fewer as you get to extreme heights.
Skewed Distributions:
Negatively skewed distribution:
Characterized by data being clustered on the right side of the graph (e.g., a very easy exam where most scores are high).
Key feature: Tail points to the left (negative skew).
Positively skewed distribution:
Characterized by data concentrated on the left side of the graph (e.g., the amount of time to complete a very easy exam).
Key feature: Tail points to the right (positive skew).
Recognizing Skewness
To determine skewness:
Place an imaginary arrowhead at the tail of the distribution:
For negative skew, arrow points left (toward negative numbers).
For positive skew, arrow points right (toward positive numbers).
Central Tendency
Central tendency provides a measure of the center of the data set. It can be measured in three primary ways:
Mean: Average of all numbers in a data set.
Median: The middle value when the numbers are arranged in ascending order.
Mode: The value that appears most frequently.
Central Tendency in Normal Distribution
In a normal distribution, the value of the mean, median, and mode are all equal:
Example:
If the average income is $30,000:
Mode = $30,000
Median = $30,000
Mean = $30,000
Central Tendency in Skewed Distribution
In skewed distributions, mean, median, and mode differ:
Example for Positively Skewed Distribution:
Mode < Median < Mean
If extreme values are added, they disproportionately raise the mean while having less effect on the median.
Mathematical Representation:
For the set {1, 2, 3, 4, 5}:
Mean = 3,
Median = 3,
Mode = 3.
When adding an extreme observation (e.g., 100),
New Mean > Median.
Results:
The mode remains largely unchanged due to its frequency basis.
The median provides a more accurate reflection of a 'typical' value in the case of skewed distributions due to less distortion from extreme values.
Variability
Variability measures how much the data points differ from each other.
Variation can be observed even with the same measure of central tendency among different groups:
Example: Two groups with an average score of 15:
One group tightly clustered around 15 (low variability).
Another group more spread out from 15 (high variability).
Standard Deviation
One common measure of variability is standard deviation:
Represents distance away from the mean or measure of central tendency.
Graphical Representation:
First standard deviation above and below the mean contains approximately 34% of observations each.
As additional standard deviations are calculated, the percentage of observations shrinks.
Real-world Application:
For an exam with a mean score of 75% and a standard deviation of ±10%, 68% of students scored between:
65% (75% - 10%)
85% (75% + 10%).
Conversely, if the standard deviation were reduced to ±5%, then 68% would score between:
70% (75% - 5%)
80% (75% + 5%).
Implications of Standard Deviation:
A smaller standard deviation indicates less variability and a clustering of scores around the mean; larger standard deviations indicate more spread and variability among scores.