Note
0.0(0)
Class Notes

Statistics Refresher – Key Vocabulary

Introduction: Why Statistics & Testing Matter

  • Test scores follow individuals from early schooling to job applications, profoundly impacting academic placements, career opportunities, and clinical diagnoses.

  • Positive results can open doors to advanced opportunities and interventions; negative scores can create barriers or misdirect resources.

  • Psychologists, teachers, employers, and researchers must possess strong statistical literacy to accurately interpret, skillfully create, and continually refine tests and assessment tools. This ensures fairness, validity, and utility in decision-making.

Measurement Basics

  • Measurement = the systematic process of assigning numbers or symbols to characteristics or attributes of objects, people, or events according to predefined rules (Stevens, 1946). These rules ensure consistency and meaning.

  • Scale: an ordered set of symbols, often numbers, designed to model the empirical properties of the attribute being measured.

  • Sample space: encompasses all possible specific values that a variable can theoretically take.

    • Gender sample space: {male, female, non-binary, or other specified gender identities}.

    • Age: typically natural integers for years {0,1,2, ext{…}} or positive real numbers for precise measurement in years or fractions thereof.

    • Height: represented by positive real numbers [0,+\infty], acknowledging that height cannot be negative.

  • Discrete variables: possess countable sample spaces, meaning values can only be whole numbers or distinct categories with inherent gaps between them (e.g., number of specific behaviors observed, count of hospitalizations, number of correct answers on a multiple-choice test).

  • Continuous variables: can take on any real value within a given range; fractions, decimals, and irrational numbers are possible, indicating a continuous scale without gaps (e.g., weight, time, temperature, reaction speed). They are limited only by the precision of the measuring instrument.

  • Rounding continuous measures must scrupulously match the precision of the measurement instrument to avoid introducing false accuracy or misrepresenting the data (e.g., reporting weight to the nearest gram when the scale only measures to the nearest kilogram).

Error in Measurement

  • Everyday “error” implies a “mistake” or blunder; however, in psychological and educational testing, error refers to any variance in an observed score that is not attributable to the true score of the construct being measured. It is inherent and unavoidable.

  • Sources of error are multifaceted:

    • Environmental factors: external conditions like noise (e.g., thunderstorm during a test), temperature fluctuations, or poor lighting that affect examinee performance.

    • Item sampling: the specific items chosen for a test may not perfectly represent the entire domain of content being assessed, leading to measurement variability.

    • Examinee state: internal factors such as fatigue, anxiety, motivation levels, or health status at the time of testing can influence performance.

    • Administrator-related factors: inconsistent instructions, deviations from standardized procedures, or even nonverbal cues from the test administrator.

    • Scoring errors: subjective scoring or computational mistakes, particularly in open-ended or complex assessments.

  • Continuous scales inherently provide an approximation of the “real” or true value; therefore, every observed score (X) is conceptualized as comprising a true score (T) and an error component (E): X = T \pm E.

  • Test developers are ethically and professionally obligated to rigorously write standardized administration instructions and train administrators to minimize the impact of error sources (e.g., site preparation, strict timing protocols, control of potential distractions, clear and consistent verbal instructions).

Levels of Measurement – The NOIR Framework

  • The Nominal, Ordinal, Interval, Ratio framework (developed by Stevens, 1946) classifies data based on the nature of the information they convey and the mathematical properties they possess. These levels ascend in power, meaning each subsequent level incorporates the properties of the preceding ones plus additional ones.

  • The level of measurement determines which mathematical operations are permissible and, consequently, which statistical analyses can be validly applied.

Nominal Scale
  • Categorical data that can only be classified into mutually exclusive and exhaustive labels or categories. The numbers assigned serve purely as identifiers and have no quantitative meaning.

  • Only equality (=) or inequality (≠) operations are meaningful.

  • Examples: political party affiliation, DSM diagnostic codes (e.g., 303.00 for alcohol intoxication – the number is a label, not an amount), Yes/No survey responses, types of fruit, religious affiliation.

  • No inherent order or magnitude; performing arithmetic operations (like averages) on nominal codes is meaningless.

Ordinal Scale
  • Data that can be rank-ordered, meaning there is a meaningful sequence, but the distances or intervals between consecutive ranks are not necessarily equal or known. Relational operators (<, >) are meaningful.

  • No absolute zero point. The difference between rank 1 and 2 may not be the same as the difference between rank 3 and 4.

  • Examples: Likert-type items in surveys (e.g., "Never," "Sometimes," "Often," "Always"), ranking job applicants by preference, educational attainment levels (e.g., high school, bachelor's, master's, doctorate), results from competitive events (1st, 2nd, 3rd place).

  • Averaging ranks is invalid because the intervals are not constant; only non-parametric statistics appropriate for ranked data should be used.

Interval Scale
  • Data possesses all the properties of ordinal scales, but additionally, the distances or intervals between consecutive values are equal and meaningful. This allows for addition (+) and subtraction (−) operations.

  • Critically, an interval scale lacks a true or absolute zero point, meaning zero does not indicate the complete absence of the measured attribute. Consequently, multiplication (×) and division (÷) operations are invalid for interpreting ratios.

  • Examples: temperatures in Celsius or Fahrenheit, calendar years, musical notes on a piano scale (A, A#, B, etc.), directions on a hue wheel.

  • Psychological total scores, such as IQ scores or personality inventory composite scores, are typically treated as interval data for practical purposes, even though their underlying item responses are often ordinal. This assumption simplifies analysis but can be debated.

Ratio Scale
  • The highest level of measurement, possessing all the properties of interval scales plus a true or absolute zero point. A value of zero genuinely signifies the complete absence of the measured attribute.

  • This true zero allows for all arithmetic operations, including multiplication (×) and division (÷), making ratios meaningful.

  • Examples: number of siblings, money balance (0 representing no money), reaction time (0 seconds means no time elapsed), dynamometer grip strength (0 kg indicates no strength applied), height, weight, volume, frequency counts.

  • In theory, a score of 0 can exist for these variables, even if physically unattainable in some contexts (e.g., assembling a puzzle in 0 seconds implies instantaneous completion, which is impossible).

Describing Data Distributions

  • A raw score is the individual, unmodified result obtained from a test or measurement, before any transformation or interpretation.

  • A distribution is an organized array of all obtained scores, illustrating their frequency and pattern across the range of possible values.

Frequency Distributions
  • Simple frequency distribution: lists every individual score obtained along with the number of times (frequency) each score occurred. Useful for small datasets or when preserving exact score information is paramount.

  • Grouped frequency distribution: organizes raw scores into "class intervals" or bins (e.g., 95–99, 90–94). Each interval has a specified width (e.g., a width of 5). This provides a more concise overview for large datasets, but there is a trade-off between detail and clarity, as individual score information within an interval is lost.

    • Cumulative frequency and cumulative percentage: can be added to grouped frequency distributions to show the number or percentage of scores falling below a given interval's upper limit. This is useful for percentile calculations.

Graphical Displays
  • Histogram: a bar-type graph used for interval or ratio data (continuous variables) where the bars are contiguous (touching), indicating the continuous nature of the variable. The X-axis typically represents scores or class intervals, and the Y-axis represents frequency.

  • Bar graph: similar to a histogram but used for nominal or ordinal data (categorical or discrete variables). The bars are non-contiguous (separated) to emphasize the discrete nature of the categories. The X-axis represents categories, and the Y-axis represents frequency.

  • Frequency polygon: a line graph created by plotting a point at the midpoint of each class interval (or score value) at its corresponding frequency and then connecting these points with straight lines. It is particularly useful for comparing two or more distributions.

  • Common shapes of distributions:

    • Normal distribution: bell-shaped, symmetric, with most scores clustered around the mean.

    • Bimodal distribution: has two distinct peaks, suggesting two separate clusters or subgroups within the data (e.g., test scores from two different teaching methods).

    • Skewed distributions: asymmetrical.

    • Positive skew (tail extending to the right): indicates that most scores are clustered at the lower end, with a few extremely high scores. Often suggests a test was too difficult.

    • Negative skew (tail extending to the left): indicates most scores are clustered at the higher end, with a few extremely low scores. Often suggests a test was too easy.

    • Uniform distribution: all scores have approximately the same frequency, forming a flat rectangle.

    • Exponential distribution: scores decrease rapidly as values increase, often seen in waiting times or decay processes.

  • Consumer beware: Axis scaling on graphical displays can significantly mislead interpretation. Truncating the Y-axis, disproportionate scaling, or inappropriate interval widths can exaggerate or minimize differences, as illustrated by examples like the "Charred House" in psychometrics which shows how altering the Y-axis range can make small changes look dramatic.

Measures of Central Tendency

  • These statistics describe the typical or central value within a dataset.

  • Mean (\bar X): the arithmetic average, calculated as the sum of all scores (\sum X) divided by the number of scores (n). For grouped data, it's ar X = \dfrac{\sum fX}{n}, where f is the frequency of each score/midpoint.

    • Sensitivity: highly sensitive to extreme scores or outliers, which can pull the mean away from the center of the majority of scores.

    • Best use: most appropriate for interval and ratio data that are approximately normally distributed.

  • Median: the middle score in an ordered distribution. If n is odd, it's the single middle score; if n is even, it's the average of the two middle scores.

    • Robustness: notably robust to outliers and skewed distributions, as it is only affected by the position of scores, not their exact values.

    • Best use: ideal for ordinal data or skewed interval/ratio data where the mean might be misleading.

  • Mode: the most frequently occurring value or score in a distribution.

    • Versatility: can be used with all levels of measurement, including nominal data, where it's the only appropriate measure of central tendency.

    • Characteristics: a distribution can have one mode (unimodal), two modes (bimodal), or more (multimodal). It is not affected by extreme values.

Measures of Variability

  • These statistics quantify the spread or dispersion of scores within a dataset.

  • Range = X{\text{max}} - X{\text{min}}: the difference between the highest and lowest scores. It's quick to calculate but highly sensitive to extreme values and doesn't reflect the spread of scores between the extremes.

  • Quartiles (Q1, Q2, Q_3): divide an ordered distribution into four equal parts, each containing 25% of the scores.

    • Q_1 (first quartile): the 25th percentile.

    • Q_2 (second quartile): the 50th percentile, which is also the median.

    • Q_3 (third quartile): the 75th percentile.

    • Inter-quartile range (IQR) = Q3 - Q1: the range of the middle 50% of scores. It's a more stable measure of variability than the full range because it's unaffected by extreme outliers.

    • Semi-IQR = \dfrac{Q3 - Q1}{2}: half of the IQR, sometimes used as a measure of dispersion around the median.

  • Mean Absolute Deviation (MAD) = \dfrac{\sum |X - \bar X|}{n}: the average of the absolute differences between each score and the mean. It's less common in inferential statistics due to the mathematical difficulties of working with absolute values, but it provides a clear interpretation of average deviation.

  • Variance (s^2): the average of the squared deviations from the mean.

    • Formula (population form): s^2 = \dfrac{\sum (X - \bar X)^2}{n}

    • Formula (sample form): s^2 = \dfrac{\sum (X - \bar X)^2}{n-1} (using n-1 in the denominator, known as Bessel's correction, provides an unbiased estimate of the population variance when calculated from a sample, especially for small sample sizes, as it accounts for 'degrees of freedom').

    • It's a fundamental concept in statistics, used in many advanced analyses, but its units are squared, making it less intuitive for direct interpretation.

  • Standard Deviation (s): the positive square root of the variance (s = \sqrt{s^2}). It's the most widely used measure of variability.

    • Interpretation: represents the average distance or spread of scores from the mean in the original units of measurement. It is the cornerstone of normal distribution theory and inferential statistics.

Shape Indices
  • Skewness: quantifies the asymmetry of a distribution, indicating whether scores are clustered more towards one tail.

    • Positive skew (right-skewed): mean > median > mode. The tail extends to the right, suggesting a test was too difficult (most scores are low).

    • Negative skew (left-skewed): mean < median < mode. The tail extends to the left, suggesting a test was too easy (most scores are high).

    • Zero skew: perfectly symmetrical distribution, like the normal curve.

  • Kurtosis: measures the peakedness or flatness of a distribution relative to a normal distribution.

    • Leptokurtic: a sharply peaked distribution with heavy tails, indicating a high concentration of scores near the mean and more extreme outliers than a normal curve.

    • Platykurtic: a relatively flat distribution with light tails, indicating scores are more spread out and fewer extreme outliers than a normal curve.

    • Mesokurtic: a distribution with a kurtosis similar to that of a normal distribution (excess kurtosis = 0).

The Normal Curve

  • A theoretical, perfectly symmetrical, unimodal, bell-shaped distribution. Its mean, median, and mode are all located at the exact center.

  • It extends infinitely in both positive and negative directions (approaching, but never touching, the X-axis).

  • Empirical Rule (68-95-99.7 Rule): specific proportions of scores fall within certain standard deviation ranges from the mean:

    • Approximately 68.27% of scores fall within \pm 1 standard deviation (\sigma) of the mean.

    • Approximately 95.45% of scores fall within \pm 2\sigma of the mean.

    • Approximately 99.73% of scores fall within \pm 3\sigma of the mean.

  • Tails: the regions of the distribution beyond \pm 2\sigma (containing about 5% of scores) or \pm 3\sigma (containing less than 0.3% of scores) hold rare or extreme scores. These are particularly relevant in educational and clinical settings for identifying exceptional individuals (e.g., gifted, intellectually disabled, individuals with very high or low clinical symptom scores).

Standard Scores

  • Raw scores themselves are often difficult to interpret without context. Standard scores convert raw scores into a common scale with a predefined mean (\mu) and standard deviation (\sigma), allowing for meaningful comparisons across different tests or with normative data.

z Scores
  • The most fundamental standard score.

  • Formula: z = \dfrac{X - \bar X}{s} (where X is the raw score, \bar X is the mean of the distribution, and s is the standard deviation).

  • Characteristics: A z-score specifies how many standard deviations a raw score is above or below the mean.

    • Has a mean of 0 and a standard deviation of 1.

    • Directly allows for comparison of performance across tests measured on different scales and for determining the percentile rank of a score by referencing areas under the standard normal curve.

T Scores
  • A linear transformation of z-scores designed to eliminate negative values and decimals into a more user-friendly scale.

  • Formula: T = 10z + 50

  • Characteristics: Has a mean of 50 and a standard deviation of 10.

    • Widely used in personality inventories (e.g., MMPI, PAI) and clinical assessments.

    • Values typically range from 20 to 80, covering \pm 3 standard deviations from the mean.

Stanines
  • Short for "standard nines," stanines are a single-digit scale, integer-based standard scores.

  • Characteristics: Has a mean of 5 and an approximate standard deviation of 2.

    • Scores range from 1 to 9.

    • Each stanine (except 1 and 9) represents a width of approximately half a standard deviation.

    • The 5th stanine spans the middle 20% of the distribution (from _0.25\sigma to +0.25\sigma), roughly centered around the mean. Stanines are frequently used in educational testing for broad interpretation.

Deviation IQ
  • Modern intelligence tests (e.g., Wechsler scales) use deviation IQ scores instead of ratio IQs to address issues with age-related scaling. They are standardized scores that indicate an individual's intellectual ability relative to their age group.

  • Characteristics: Typically set with a mean (\mu) of 100 and a standard deviation (\sigma) of 15.

    • This means about 95% of the population falls between IQ scores of 70 (2 SD below mean) and 130 (2 SD above mean).

Linear vs Non-Linear Transforms
  • Linear transformations: preserve the exact shape of the original distribution. If the raw score distribution is skewed, the linearly transformed standard score distribution (e.g., z-scores, T-scores, Deviation IQ) will also be skewed. They maintain relative distances between scores.

    • Examples: Converting raw scores to z-scores or then to T-scores, or the SAT's original scale (mean 500, SD 100).

  • Non-linear transformations (Normalization): are required when the raw score distribution is significantly skewed or non-normal and one wishes to force it into a normal shape (or some other desired distribution). This process, known as normalization, creates normalized standard scores.

    • This involves converting raw scores to percentiles, then finding the z-score associated with that percentile in a normal distribution. While this can make scores easier to interpret in relation to a normal curve, it distorts the original interval properties of the raw scores.

Correlation & Inference

  • Correlation coefficient (r): a single numeric index that quantifies the degree and direction of a linear relationship between two continuous variables. Its value ranges from -1.0 to +1.0.

    • Sign: indicates the direction of the relationship.

    • Positive (+) sign: as one variable increases, the other tends to increase (e.g., study time and grades).

    • Negative (−) sign: as one variable increases, the other tends to decrease (e.g., absenteeism and grades).

    • Magnitude (absolute value of r): indicates the strength of the linear relationship.

    • Values close to \pm 1.0 (e.g., \pm 0.8 to \pm 1.0) suggest a strong relationship.

    • Values close to 0 (e.g., \pm 0.1 to \pm 0.3) suggest a weak or negligible relationship.

  • Pearson product-moment correlation coefficient (r): the most common correlation coefficient, suitable for two continuous variables that exhibit a linear relationship and are approximately normally distributed.

    • Assumptions: linearity (the relationship can be approximated by a straight line), normality of variables, and homoscedasticity (equal variance of residuals across all levels of the independent variable).

    • Formula (raw-score version, for computational efficiency): r = \dfrac{N\sum XY - (\sum X)(\sum Y)}{\sqrt{[N\sum X^2 - (\sum X)^2][N\sum Y^2 - (\sum Y)^2]}}

  • Coefficient of determination (r^2): calculated by squaring the Pearson correlation coefficient (r^2). It represents the proportion (or percentage, when multiplied by 100) of the variance in one variable that can be statistically explained or accounted for by the variance in the other variable.

    • Example: If r = 0.90, then r^2 = 0.81. This means 81% of the variance in Y can be explained by X (or vice-versa), leaving 19% unexplained variance.

  • Spearman's rho (\rho or r_s): a non-parametric correlation coefficient used when data are at least ordinal, or when dealing with small sample sizes (n < 30), or when assumptions for Pearson's r (like linearity or normality) are violated. It calculates the correlation between the ranks of the two variables, not their raw scores.

Scatterplots
  • Graphical displays that plot observations of two variables for a set of data. The X-axis represents one variable, and the Y-axis represents the other.

  • Visual cues for interpreting relationships:

    • Strong positive correlation: points cluster tightly along an upward-sloping line from left to right.

    • Strong negative correlation: points cluster tightly along a downward-sloping line from left to right.

    • Weak correlation: points are widely scattered, showing no clear linear pattern.

    • No correlation: points are randomly scattered like a cloud.

    • Curved pattern: indicates a curvilinear relationship (e.g., inverted U-shape). In such cases, Pearson's r, which only measures linear relationships, would be inappropriate or misleadingly low.

    • Outliers: individual data points that deviate significantly from the general pattern of the other points. Outliers can heavily influence the correlation coefficient, potentially inflating or deflating it. They should be carefully investigated for data entry errors, measurement issues, or truly exceptional cases.

    • Restriction of range: occurs when the range of scores for one or both variables is artificially limited (e.g., only examining the correlation between SAT scores and college GPA for students admitted to an elite university, where only high SAT scores are present).

    • Effect: restriction of range almost always weakens (reduces) the observed correlation coefficient, making it appear weaker than the true correlation would be across the full range of data. This is a crucial consideration in interpreting research findings.

Meta-Analysis & Effect Size

  • Meta-analysis: a sophisticated statistical method that systematically combines the results of multiple independent studies addressing the same research question. Its goal is to synthesize findings and obtain a more precise and robust overall estimate of an effect.

  • Advantages:

    • Replicability: follows a predefined protocol for study selection and analysis.

    • Weighted by sample size: studies with larger, more precise samples contribute more to the overall estimate.

    • Focus on magnitude: emphasizes the effect size (the strength or magnitude of a phenomenon) rather than just statistical significance (p-values), providing a more practical interpretation of findings.

    • Evidence-based practice: serves as a cornerstone for evidence-based decision-making in various fields, providing summaries of the best available research evidence.

    • Increased statistical power: combining studies increases the total sample size, making it easier to detect true effects.

  • Effect size: a standardized measure of the magnitude of an observed effect or relationship, independent of sample size. It quantifies the practical significance of a finding.

    • Often expressed as a correlation coefficient (r) for associations or a standardized mean difference (Cohen's d) for group comparisons.

  • Quality depends on methodological rigor: the validity of a meta-analysis relies heavily on the quality of the included studies and the meticulousness of the meta-analytic protocol (e.g., clear study inclusion/exclusion criteria, thorough search strategies, methods for assessing and mitigating publication bias or heterogeneity between studies).

Key Formulas at a Glance

  • ar X = \dfrac{\sum X}{n} \quad \text{Mean}

  • s = \sqrt{\dfrac{\sum (X - \bar X)^2}{n}} \quad \text{SD (population form)}

  • MAD = \dfrac{\sum |X - \bar X|}{n}

  • IQR = Q3 - Q1

  • z = \dfrac{X - \bar X}{s};\; T = 10z + 50

  • r = \dfrac{N\sum XY - (\sum X)(\sum Y)}{\sqrt{[N\sum X^2 - (\sum X)^2][N\sum Y^2 - (\sum Y)^2]}};\; r^2 = r \cdot r

Practical & Ethical Considerations

  • Match scale level to statistical operations: Always ensure that the statistical operations performed are appropriate for the level of measurement of the data. For example, avoid averaging ranks (ordinal data) or calculating ratios with interval data, as these operations violate the inherent properties of those scales and can lead to meaningless or incorrect conclusions.

  • Report score precision consistent with instrument accuracy: Do not report more decimal places than the precision of the measuring instrument allows. Over-precision implies a level of accuracy that does not exist and can be misleading.

  • Beware misleading graphs (axis truncation, improper scaling): Always critically evaluate graphical representations of data. Manipulated axes or scales can visually distort trends and relationships, leading to biased interpretations.

  • Account for error and context when interpreting outliers or extremes: Outliers should not be indiscriminately removed without investigation. Understand potential sources of measurement error and contextual factors that might explain extreme scores before making decisions about their inclusion or exclusion.

  • Meta-analytic conclusions should guide but not replace professional judgment: While meta-analyses provide powerful evidence syntheses, they generalize findings. Individual clinical or educational decisions must always integrate meta-analytic evidence with specific client characteristics, available resources, and professional expertise.

Self-Check: Can You…?

  • Distinguish interval vs ratio with examples, explaining the critical difference of a true zero?

  • State the exact proportions of scores that fall within \pm 1\sigma, \pm 2\sigma, and \pm 3\sigma under the normal curve?

  • Convert a score that is 1.5\sigma above the mean into its equivalent z-score, T-score, and Deviation IQ score (assuming standard parameters)?

  • Explain comprehensively why restriction of range typically lowers the observed correlation coefficient?

  • Identify the specific situations or data characteristics that would necessitate using Spearman's rho instead of Pearson's r for correlation? (End of bullet-point study notes)

Note
0.0(0)
Class Notes