Notes on Bar Graphs, Histograms, Scatter Plots, Data Collection, Correlation, Distribution Shapes, and Statistical Significance

Bar graphs vs histograms

  • Bar graph: bars with gaps between categories; used for discrete categories or grouped data.

  • Histogram: no gaps between bars; displays frequency distribution for continuous data; essentially showing how often values occur within bins.

  • They are largely interchangeable for conveying the same frequency information, when bins/categories are chosen appropriately.

  • Important caution: these charts can be manipulated to mislead the audience.

    • Example given: reliability ratings for trucks (Ford, Toyota, Chevy, Ram).

    • Trick: fine print can say data are based on major repairs after ten years, which skews the appearance of reliability.

    • If Ford shows 96% reliability and Toyota 94% after ten years, a naïve reading may claim Ford is clearly more reliable by 2%; in context, the practical difference may be minor.

    • Lesson: read the fine print and consider what the percentages are actually measuring; be cautious with marketing-driven data presentation.

Scatter plots and the regression-to-the-mean concept

  • Scatter plots: show clusters of two variables and indicate the strength and direction of their relationship via the slope.

  • Relationship directions:

    • Positive correlation: as one variable increases, the other tends to increase.

    • Negative correlation: as one variable increases, the other tends to decrease.

    • No/weak correlation: difficult to discern a clear positive or negative relationship (near zero correlation).

  • Regression toward the mean:

    • When a measurement is extreme on the first occasion, the subsequent measurement tends to be closer to the average.

    • Example: a basketball player averages 30 points per game; in the first game of a season, scores 60 points; by season end, scores are expected to be closer to their average rather than remaining at 60.

  • Data collection can be qualitative or quantitative (or both):

    • Qualitative data: descriptive information from interviews, observations, case studies.

    • Quantitative data: numerical data (e.g., Likert scales). Example provided: chipotle question: "Will you come back to Chipotle in the next week?" on a 5-point scale (1 = strongly disagree, 5 = strongly agree).

  • Likert scale example highlights:

    • Quantitative data (scale 1–5) is cheap and easy to collect but may lack depth.

    • Qualitative questions (e.g., "What do you like about Chipotle?") provide richer detail but are more labor-intensive.

  • Scatter plot examples:

    • Positive trend: two variables move upward together.

    • Negative trend example verbally given: more drugs used correlates with lower grades.

    • No clear trend: data scattered with no obvious pattern.

Correlation concepts and the correlation coefficient

  • Correlation coefficient measures the strength and direction of a linear relationship between two variables.

  • Value range: r[1,1]r \,\in\,[-1,1]

    • r ≈ +1: strong positive linear relationship.

    • r ≈ -1: strong negative linear relationship.

    • r ≈ 0: little to no linear relationship.

  • Key interpretation caveat: a higher absolute value of r indicates a stronger linear relationship, not necessarily causation.

  • Examples:

    • Positive correlation example: more studying, better grades (both variables increase together).

    • Negative correlation example: more drugs, lower grades (as one increases, the other decreases).

  • Reminder: correlation does not imply causation; r tells about linear association, not about cause-and-effect.

Distribution shapes, skewness, and related ideas

  • Skewness: describes where the bulk of data lies relative to the tails

    • Positive skew: bulk of data on the left, tail to the right.

    • Negative skew: bulk of data on the right, tail to the left.

  • Bell curve (normal distribution): classic reference point for many natural phenomena.

  • IQ distribution as discussed in the session:

    • The teacher described a bell-curve-like view: 70% of IQs fall within 15% of the mean (mean = 100), roughly between 85 and 115.

    • Also noted: 90% of IQs fall within 30% of the mean, roughly between 70 and 130.

    • The speaker noted these are approximations in a casual discussion.

  • Bimodal distribution: two peaks in the data set.

  • Aside/aside anecdote: a humorous mention of Ron McDonald and tying it to the bell curve; included as a classroom anecdote rather than a methodological point.

  • Practical takeaway: distribution shape affects interpretation of central tendency and variability.

Central tendency measures

  • Central tendency concepts describe typical values in a distribution:

    • Mode: the most frequently occurring value (the peak(s) in the distribution).

    • Mean: the arithmetic average; xˉ=1N<em>i=1Nx</em>i\bar{x} = \frac{1}{N}\sum<em>{i=1}^N x</em>i.

    • Median: the middle value when data are ordered; divides the dataset into two halves.

  • The session used a narrative example around test scores::

    • If 66 students take a test, the mode might be the score that occurs most often (e.g., many students scoring 16/16 on a vocab test), and the mean might be around 98–99% if most students perform well.

    • The mode is often higher than the mean when a large portion of students is well-prepared and there are a few low outliers pulling the mean down.

  • Median is the middle score, with half above and half below; example discussed with questions of exact placement for a class size (illustrative purposes).

  • Range context:

    • Range = max(xi) − min(xi).

    • Example: scores from 6% to 96% yield a range of 90 percentage points.

    • It is common to exclude the highest and lowest scores to get a clearer sense of typical performance.

Spread, variability, and the standard deviation concept

  • Range describes the overall spread (from max to min), but it is sensitive to outliers.

  • Standard deviation measures how spread out data are around the mean; lower standard deviation indicates more consistency.

  • Illustrative basketball example:

    • Player A: scores 50, 10, 12, 52, 14 (high variability).

    • Player B: scores 35, 28, 32, 30, 31 (more consistent).

    • If you’re building a team, you would prefer the player with the lower standard deviation (more reliable scoring).

  • Everyday application for test design: a test writer aims for a relatively low standard deviation among test scores, allowing for a predictable distribution with a few outliers.

  • Formulas:

    • Population standard deviation: σ=1N<em>i=1N(x</em>iμ)2\sigma = \sqrt{\frac{1}{N}\sum<em>{i=1}^N (x</em>i - \mu)^2}

    • Sample standard deviation: s=1N1<em>i=1N(x</em>ixˉ)2s = \sqrt{\frac{1}{N-1}\sum<em>{i=1}^N (x</em>i - \bar{x})^2}

  • Interpretation: a low standard deviation signals that most scores cluster around the mean; a high standard deviation signals wide dispersion.

Percentiles and standardized testing context

  • Percentile definition: a percentile indicates the value below which a given percentage of observations fall.

  • Example interpretation: a student scoring in the 90th percentile scores higher than 90% of peers who took the same test.

  • Standardized testing context mentioned:

    • Percentiles and testing standards have historically been used (e.g., ACT, SAT) to rank or assess performance.

    • There was commentary about shifts in the usage and perceived role of these tests in college admissions.

Statistical significance, peer review, and replication

  • Experimental results are evaluated by statistical significance: the probability that results are due to random chance rather than a real effect.

  • If confounding variables are minimized or eliminated, the study is closer to a pure experiment.

  • Key validity concepts:

    • Peer review: other experts read and critique the study or test to assess its quality and validity.

    • Replicability (replication): another class/school can reproduce the study with similar procedures and obtain close scores, demonstrating reliability across settings.

  • The speaker referenced prior content about confounding (compounding) variables and emphasized the importance of valid, verifiable research practices.

Practical takeaways and context

  • Data literacy takeaway: always consider how data are collected and presented (qualitative vs quantitative, scale type, and how bins/categories are defined).

  • Be wary of marketing-driven data presentations and check the underlying definitions and time frames.

  • Understand that correlation does not imply causation; use the correlation coefficient to gauge linear association, not to claim cause and effect.

  • Recognize that outliers influence measures of central tendency and spread, and decisions about including or excluding them can change interpretations.

  • In testing and assessment contexts, aim for a balanced distribution with reasonable central tendency and low to moderate variability to ensure fair measurements across populations.

Quick recap of key symbols and terms

  • Central tendencies: mode, mean (\bar{x}), median

  • Range: extrange=max(x<em>i)min(x</em>i)ext{range} = \max(x<em>i) - \min(x</em>i)

  • Correlation: r=cov(X,Y)σ<em>Xσ</em>Y,r[1,1]r = \frac{\mathrm{cov}(X,Y)}{\sigma<em>X\,\sigma</em>Y}, \quad r \in [-1,1]

  • Standard deviation: σ=1N<em>i=1N(x</em>iμ)2\sigma = \sqrt{\frac{1}{N}\sum<em>{i=1}^N (x</em>i - \mu)^2} (population) and s=1N1<em>i=1N(x</em>ixˉ)2s = \sqrt{\frac{1}{N-1}\sum<em>{i=1}^N (x</em>i - \bar{x})^2} (sample)

  • Percentiles: value below which a given percentage of observations fall

  • Distributions: skewness (positive/negative), normal (bell curve), bimodal (two peaks)

  • Significance concepts: statistical significance, peer review, replication