Notes on Bar Graphs, Histograms, Scatter Plots, Data Collection, Correlation, Distribution Shapes, and Statistical Significance

Bar graphs vs histograms

Bar graph: bars with gaps between categories; used for discrete categories or grouped data.
Histogram: no gaps between bars; displays frequency distribution for continuous data; essentially showing how often values occur within bins.
They are largely interchangeable for conveying the same frequency information, when bins/categories are chosen appropriately.
Important caution: these charts can be manipulated to mislead the audience.
- Example given: reliability ratings for trucks (Ford, Toyota, Chevy, Ram).
- Trick: fine print can say data are based on major repairs after ten years, which skews the appearance of reliability.
- If Ford shows 96% reliability and Toyota 94% after ten years, a naïve reading may claim Ford is clearly more reliable by 2%; in context, the practical difference may be minor.
- Lesson: read the fine print and consider what the percentages are actually measuring; be cautious with marketing-driven data presentation.

Scatter plots and the regression-to-the-mean concept

Scatter plots: show clusters of two variables and indicate the strength and direction of their relationship via the slope.
Relationship directions:
- Positive correlation: as one variable increases, the other tends to increase.
- Negative correlation: as one variable increases, the other tends to decrease.
- No/weak correlation: difficult to discern a clear positive or negative relationship (near zero correlation).
Regression toward the mean:
- When a measurement is extreme on the first occasion, the subsequent measurement tends to be closer to the average.
- Example: a basketball player averages 30 points per game; in the first game of a season, scores 60 points; by season end, scores are expected to be closer to their average rather than remaining at 60.
Data collection can be qualitative or quantitative (or both):
- Qualitative data: descriptive information from interviews, observations, case studies.
- Quantitative data: numerical data (e.g., Likert scales). Example provided: chipotle question: "Will you come back to Chipotle in the next week?" on a 5-point scale (1 = strongly disagree, 5 = strongly agree).
Likert scale example highlights:
- Quantitative data (scale 1–5) is cheap and easy to collect but may lack depth.
- Qualitative questions (e.g., "What do you like about Chipotle?") provide richer detail but are more labor-intensive.
Scatter plot examples:
- Positive trend: two variables move upward together.
- Negative trend example verbally given: more drugs used correlates with lower grades.
- No clear trend: data scattered with no obvious pattern.

Correlation concepts and the correlation coefficient

Correlation coefficient measures the strength and direction of a linear relationship between two variables.
Value range: $r \,\in\,[-1,1]$
- r ≈ +1: strong positive linear relationship.
- r ≈ -1: strong negative linear relationship.
- r ≈ 0: little to no linear relationship.
Key interpretation caveat: a higher absolute value of r indicates a stronger linear relationship, not necessarily causation.
Examples:
- Positive correlation example: more studying, better grades (both variables increase together).
- Negative correlation example: more drugs, lower grades (as one increases, the other decreases).
Reminder: correlation does not imply causation; r tells about linear association, not about cause-and-effect.

Distribution shapes, skewness, and related ideas

Skewness: describes where the bulk of data lies relative to the tails
- Positive skew: bulk of data on the left, tail to the right.
- Negative skew: bulk of data on the right, tail to the left.
Bell curve (normal distribution): classic reference point for many natural phenomena.
IQ distribution as discussed in the session:
- The teacher described a bell-curve-like view: 70% of IQs fall within 15% of the mean (mean = 100), roughly between 85 and 115.
- Also noted: 90% of IQs fall within 30% of the mean, roughly between 70 and 130.
- The speaker noted these are approximations in a casual discussion.
Bimodal distribution: two peaks in the data set.
Aside/aside anecdote: a humorous mention of Ron McDonald and tying it to the bell curve; included as a classroom anecdote rather than a methodological point.
Practical takeaway: distribution shape affects interpretation of central tendency and variability.

Central tendency measures

Central tendency concepts describe typical values in a distribution:
- Mode: the most frequently occurring value (the peak(s) in the distribution).
- Mean: the arithmetic average; $\bar{x} = \frac{1}{N}\sum{i=1}^N xi$ .
- Median: the middle value when data are ordered; divides the dataset into two halves.
The session used a narrative example around test scores::
- If 66 students take a test, the mode might be the score that occurs most often (e.g., many students scoring 16/16 on a vocab test), and the mean might be around 98–99% if most students perform well.
- The mode is often higher than the mean when a large portion of students is well-prepared and there are a few low outliers pulling the mean down.
Median is the middle score, with half above and half below; example discussed with questions of exact placement for a class size (illustrative purposes).
Range context:
- Range = max(xi) − min(xi).
- Example: scores from 6% to 96% yield a range of 90 percentage points.
- It is common to exclude the highest and lowest scores to get a clearer sense of typical performance.

Spread, variability, and the standard deviation concept

Range describes the overall spread (from max to min), but it is sensitive to outliers.
Standard deviation measures how spread out data are around the mean; lower standard deviation indicates more consistency.
Illustrative basketball example:
- Player A: scores 50, 10, 12, 52, 14 (high variability).
- Player B: scores 35, 28, 32, 30, 31 (more consistent).
- If you’re building a team, you would prefer the player with the lower standard deviation (more reliable scoring).
Everyday application for test design: a test writer aims for a relatively low standard deviation among test scores, allowing for a predictable distribution with a few outliers.
Formulas:
- Population standard deviation: $\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi - \mu)^2}$
- Sample standard deviation: $s = \sqrt{\frac{1}{N-1}\sum{i=1}^N (xi - \bar{x})^2}$
Interpretation: a low standard deviation signals that most scores cluster around the mean; a high standard deviation signals wide dispersion.

Percentiles and standardized testing context

Percentile definition: a percentile indicates the value below which a given percentage of observations fall.
Example interpretation: a student scoring in the 90th percentile scores higher than 90% of peers who took the same test.
Standardized testing context mentioned:
- Percentiles and testing standards have historically been used (e.g., ACT, SAT) to rank or assess performance.
- There was commentary about shifts in the usage and perceived role of these tests in college admissions.

Statistical significance, peer review, and replication

Experimental results are evaluated by statistical significance: the probability that results are due to random chance rather than a real effect.
If confounding variables are minimized or eliminated, the study is closer to a pure experiment.
Key validity concepts:
- Peer review: other experts read and critique the study or test to assess its quality and validity.
- Replicability (replication): another class/school can reproduce the study with similar procedures and obtain close scores, demonstrating reliability across settings.
The speaker referenced prior content about confounding (compounding) variables and emphasized the importance of valid, verifiable research practices.

Practical takeaways and context

Data literacy takeaway: always consider how data are collected and presented (qualitative vs quantitative, scale type, and how bins/categories are defined).
Be wary of marketing-driven data presentations and check the underlying definitions and time frames.
Understand that correlation does not imply causation; use the correlation coefficient to gauge linear association, not to claim cause and effect.
Recognize that outliers influence measures of central tendency and spread, and decisions about including or excluding them can change interpretations.
In testing and assessment contexts, aim for a balanced distribution with reasonable central tendency and low to moderate variability to ensure fair measurements across populations.

Quick recap of key symbols and terms

Central tendencies: mode, mean (\bar{x}), median
Range: $ext{range} = \max(xi) - \min(xi)$
Correlation: $r = \frac{\mathrm{cov}(X,Y)}{\sigmaX\,\sigmaY}, \quad r \in [-1,1]$
Standard deviation: $\sigma = \sqrt{\frac{1}{N}\sum{i=1}^N (xi - \mu)^2}$ (population) and $s = \sqrt{\frac{1}{N-1}\sum{i=1}^N (xi - \bar{x})^2}$ (sample)
Percentiles: value below which a given percentage of observations fall
Distributions: skewness (positive/negative), normal (bell curve), bimodal (two peaks)
Significance concepts: statistical significance, peer review, replication