Measurement, Distributions, and Percentiles – Study Notes

Measurement, Distributions, and Percentiles – Comprehensive Study Notes

Acknowledgement and context
- Opening and scope: this week covers measurement, frequency distributions, and percentiles; gradual introduction to numbers.
- Mid-semester exam scope: weeks 1–4 content; practice materials and quizzes recommended; exam date announced on Blackboard (Saturday, September 6).
- Relevance across degree: data cleaning, exploration, and analysis are common tasks in assignments; honors year in psychology involves a full year of study design, data collection, analysis, and thesis writing – these topics are foundational for that workflow.
Big-picture progression of a study in psychology
- Three stages: design a study, run the study, then analyze the numbers you collect.
- First data-processing steps: create plots, explore data, clean data.
- Throughout a degree, you’ll repeatedly clean, explore, and analyze data; in honors, you’ll perform this across a year end-to-end.
Core topics of the lecture
- First half: measurement of psychological constructs, reliability, sensitivity, and related concepts.
- Second half: data presentation and storytelling with figures; plotting decisions that tell a clear story.
Measurement and empirical foundations
- Constructs vs. observable phenomena: psychological constructs like anxiety or memory are not directly observable; operational definitions are needed to bound what counts as a measure of the construct.
- Operational definition example: imitation in infants (becoming the stimulus) with a tongue-protrusion paradigm.
- Coding scheme example (to operationalize imitation):
  - 0 = no response
  - 1 = partial response (e.g., some tongue movement but not clearly imitative)
  - 2 = full response (clear, unambiguous tongue protrusion)
- Researchers often train coders, use multiple trials, and rely on agreed-upon criteria to improve reliability and validity of these judgments.
- Empiricism and objectivity: measurement should capture observable phenomena that can be checked and verified by others; openness and replication are healthy for scientific progress.
Variables and measurement scales (types and implications)
- Variable: a characteristic of interest for each individual in a population or sample (e.g., memory capacity, distraction condition).
- Qualitative (categorical) vs. quantitative (numerical) attributes:
- Qualitative: categories without intrinsic numeric magnitude (e.g., gender, eye color, political affiliation).
- Quantitative: numeric values with meaningful magnitude (e.g., height, weight, income).
- Measurement is about assigning numbers to observations according to consistent rules (operational definitions).
- Qualitative variables can be coded numerically (e.g., eye color: 0–blue, 1–brown, etc.), but not all numerical operations are meaningful on qualitative data (e.g., averaging eye color codes).
- Quantitative scales and ordering:
- Discrete vs. Continuous: discrete has whole numbers (e.g., number of cars passing by); continuous can take any value within a range (e.g., height).
- Dichotomous: a special discrete case with only two values (e.g., alive/dead, true/false).
- Scales of measurement (from simplest to most informative):
- Nominal: categories with no intrinsic order (e.g., eye color, political party labels). No meaningful magnitude, equal intervals, or true zero.
  - Example: color labels (Yellow=2, Green=4, etc.) are labels; the numbers are identifiers, not magnitudes.
- Ordinal: order matters, but intervals between values are not necessarily equal (e.g., race placement, level of preference).
  - Example: ranking Smarties by preference: red=1, blue=2, green=3, etc. Order matters, but gaps are not quantified.
- Interval: order and meaningful equal intervals, but no true zero (e.g., IQ scores, temperature in Celsius).
  - Distances between values are interpretable, but 0°C does not mean 'no temperature.'
- Ratio: order, meaningful equal intervals, plus a meaningful zero that allows ratio comparisons (e.g., height, weight, Kelvin temperature, age).
  - With a true zero, statements like 'twice as tall' are meaningful.
- The choice of scale affects allowable statistics and the kinds of claims you can make.
- Measurement of constructs in psychology requires careful consideration of scale properties and the interpretation of results.
Reliability and validity: core psychometrics concepts
- Reliability: stability and consistency of a measure across time, raters, or trials.
- Test-retest reliability: administer the same test twice; scores should be similarly related if the underlying trait is stable.
  - Represented visually by a scatter plot of Test 1 vs Test 2 scores; a strong positive correlation indicates reliability.
  - Realistically, perfect identical scores are unlikely due to day-to-day variation (sleep, mood, etc.).
- Inter-rater reliability: agreement between two or more raters who assess the same data; assessed by correlation between their scores.
  - Acceptable reliability is often around r ≈ 0.60 or higher; higher is better.
- Validity: the extent to which a measure captures what it is intended to measure.
- Internal validity: the extent to which observed effects are due to the manipulation rather than confounds; lack of control for confounds reduces internal validity.
- External validity: generalizability of findings beyond the study sample or setting (e.g., WEIRD samples: Western, Educated, Industrialized, Rich, Democratic).
  - Low external validity means limited generalizability to other populations or cultures.
- Construct validity: how well a test or measure actually captures the theoretical construct of interest.
  - Example: Beck Depression Inventory (BDI) faced questions of whether some items truly map onto depression vs. anxiety; concerns about construct validity if items overlap with anxiety constructs.
- Content/Face validity: the intuitive apparent fit of a measure to the construct; what it seems to measure on the surface.
  - Example: a depression measurement that asks about temperature would likely have low face validity despite potential statistical reliability.
- Predictive validity: extent to which a measure predicts outcomes it should predict (e.g., ATAR predicting university performance).
- Range effects (floor and ceiling effects): a measure too easy or too hard can fail to discriminate among participants.
- Ceiling effect: most participants perform at the top end, limiting ability to detect differences.
- Floor effect: most participants perform at the bottom end.
- Pilot testing helps calibrate measures to avoid these effects, ensuring sensitivity to differences.
Measurement design considerations and pilot testing
- Pilot testing: iterative testing of the design and stimuli to ensure the task yields usable, discriminating data; helps identify floor/ceiling effects and timing or presentation issues.
- The role of pilot testing in avoiding wasted data collection time and ensuring the stimulus yields a useful range of responses.
- Ethical and practical implications: robust measurement improves scientific validity and the efficiency of research; poor measurement wastes resources and could mislead interpretations.
Designing studies and addressing variability
- Study types and randomization: experimental studies, randomized controlled trials, observational studies, quasi-experiments, and correlational designs; randomization helps control for confounds.
- Confounding variables: factors that co-occur with the IV and can threaten the interpretation of results; strategies include control groups/conditions and counterbalancing.
- Independent groups design vs. repeated measures design:
- Independent groups: different participants in each condition; straightforward but may require more participants.
- Repeated measures: same participants across conditions; more powerful but susceptible to carryover and order effects; counterbalancing mitigates confounds.
Data organization, exploration, and visualization (the second half of the lecture)
- Purpose of displaying data: to tell a story, reveal patterns, detect errors, and support interpretation beyond text.
- Data quality reality: psychology data are often messy due to human factors; data exploration helps identify anomalies, missing values, and transcription errors.
- Data cleaning: removing or correcting erroneous data, filtering noise, handling missing values, and preparing data for analysis.
- From raw matrices to interpretable summaries: moving from a matrix of 100 students × 10 questions to interpretable summaries such as distributions and summaries.
Frequency distributions and data display options
- Frequency table: tallies the number of observations per score or category; useful for qualitative data and small ranges.
- Relative frequency: the proportion of observations in each category, computed as $ext{relative frequency} = \frac{ ext{frequency}}{N}$ where $N$ is the total sample size.
- Cumulative frequency: the total number of observations up to and including a given category; used to compute percentiles.
- Intervals (bins) for continuous data: group observations into non-overlapping bins (e.g., 50–54, 55–59, etc.). Practical guidance: aim for around 10–20 bins; avoid overlaps; choose bins to enable proper polygon plotting and to support meaningful interpretation.
- Why start bins with an underflow bin (e.g., 45–49) even if empty: to ensure the frequency polygon can start at zero and hit the x-axis cleanly.
- Frequency polygon: a line plot connecting bin midpoints with heights corresponding to frequencies; useful for visualizing distributions, especially when comparing multiple groups.
- Bar graphs: good for qualitative (nominal) data; bars should not touch to reflect discrete categories.
- Histograms: bar plots with touching bars; appropriate for continuous or binned data to reflect the continuity of the scale.
- Box-and-whisker plots: convey median, interquartile range (IQR), and extremes; useful for showing central tendency and dispersion in one figure; box spans the central 50% of data (IQR); median shown inside the box; whiskers extend to the min and max or to some percentile bounds.
- Frequency histograms vs. frequency polygons vs. box plots: each has strengths for different data types and storytelling goals; choice depends on the data and the story you want to tell.
- Example storytelling with plots: male vs. female weights, actual vs. ideal weights; using frequency polygons to compare distributions and dot plots to show cross-group comparisons.
Percentiles and percentile calculations (core quantitative concept)
- Percentile: the value below which a specified percentage of scores fall; percentile rank is the proportion of scores at or below a given value.
- Fundamental formula:
- Percentile rank of a score: $P = \frac{CF}{N} imes 100$ where $CF$ is the cumulative frequency up to that score, and $N$ is the total number of scores.
- Inverse calculation (finding the score at a given percentile):
- Cumulative frequency target: $CF = \frac{P}{100} imes N$ . Then locate the smallest score whose cumulative frequency is at least $CF$ .
- Practical example from the transcript:
- Suppose a distribution with total $N=20$ and a score of 23 has a cumulative frequency of 7. The percentile would be:
  - P = rac{CF}{N} imes 100 = rac{7}{20} imes 100 = 35 ext{ }rac{ ext{percent}}{}
  - ext{So a score of 23 is in the 35th percentile.}
- To find the score at the 85th percentile for the same data:
  - Target CF = rac{85}{100} imes 20 = 17.
  - Look for the score with cumulative frequency 17; the example in the transcript found that to be a score of 25, so you’d need a score of 25 or higher to beat at least 85% of the class.
- Relative vs. cumulative frequency recap:
- Relative frequency: frequency/N.
- Cumulative frequency: sum of frequencies up to and including a given score.
- Example with larger data: TV-watching hours (259 students) – calculating percentile for 7 hours from a grouped distribution and summarizing with a frequency polygon to visualize distribution around the 63rd percentile.
- Practical interpretation: percentile ranks convey how an individual compares to the distribution (e.g., “in the 35th percentile” means better than 35% of the group).
Illustrative data examples used in the lecture
- Imitation operational definition example (in infants): demonstrated coding challenges and inter-rater reliability concerns when judging whether an infant imitates tongue protrusion.
- Weight data example (72 male students): discussion of wide weight range, use of bins (e.g., 60–64 kg), and how to interpret a 65–69 kg peak.
- TV-watching hours example (259 students): determination of a typical amount and identification of an extreme outlier (e.g., 40 hours/week).
- Male vs. female weight comparisons using frequency polygons and ideal weights to illustrate storytelling with plots.
Practical implications for data analysis and reporting
- Choose graphs that tell the story clearly and faithfully; the reader should grasp the message at a glance.
- Use appropriate data displays for different data types:
- Qualitative data: bar graphs (nominal categories, non-touching bars to emphasize discreteness).
- Quantitative data: histograms, frequency polygons, box plots; consider 10–20 bins for histograms.
- Data quality and preparation: removing or correcting errors, identifying outliers, and ensuring the data meet the assumptions of planned analyses.
- Inferential testing readiness: well-plotted data facilitate checking assumptions (normality, homogeneity of variance) and improve interpretability of statistical tests.
- Reporting and publication: visuals should support the written narrative and help convey the study’s claims without excessive text.
Links to upcoming and related content
- Next lecture focus: central tendency (mean, median, mode) and variability (how scores move around the center).
- Mathematical prerequisites for upcoming topics: basic calculator skills (add/subtract/multiply/divide, square, square root).
- Symbols and notation to know:
- Sigma for summation:
  \sum x\,
- Inequalities and their counterparts (>, <, ≥, ≤).
- Positive and negative values: +/− signs.
- Readings and practice materials:
- Aaron textbook, Chapter 1; UQ Extend Module 4.
- For next week: Aaron Chapter 2; UQ Extend Module 5.
- Assessment:
- Quiz for the week opens in 1 hour and closes Monday.
Ethical, philosophical, and practical implications raised
- Open science and construct validity: the need for robust constructs and transparent operational definitions to enable replication and critique.
- External validity concerns: most psychology research uses WEIRD populations; explicit caution about generalizability to diverse cultures and settings.
- The healthy scientific process includes debate over operational definitions and ongoing refinement; disagreements drive methodological improvements and consensus over time.
Quick reference formulas and concepts (summary)
- Percentile rank: P = rac{CF}{N} imes 100
- Inverse percentile (finding score at percentile P): CF = rac{P}{100} imes N$$
- Box-and-whisker plot components: median, interquartile range (IQR), whiskers (min/max or defined bounds).
- Reliability types: test-retest (consistency over time), inter-rater (consistency across raters).
- Validity types: internal, external, construct, content/face, predictive.
- Data-display choices: nominal data → bar graphs with gaps; continuous data → histograms or frequency polygons; distributions → consider 10–20 bins; outliers identified via plots.
- Range effects: ceiling/floor effects; pilot testing to optimize measurement sensitivity.
Final reminders for exam preparation
- Practice building and interpreting frequency tables, histograms, and frequency polygons.
- Be comfortable with percentiles, cumulative frequencies, and translating percentile ranks into actionable interpretation.
- Understand the relationship between reliability, validity, and the conclusions you can draw from data.
- Review the next set of topics (central tendency and variability) and ensure you can perform basic statistical operations with a calculator.
Notes on exam readiness
- Focus on being able to explain why we choose certain scales and plots for different data types.
- Be able to articulate the implications of floor/ceiling effects and how pilot testing mitigates them.
- Be able to discuss external validity concerns in the context of WEIRD samples and cross-cultural generalizability.
References to course materials mentioned in the lecture
- Aaron textbook, Chapter 1 (and Chapter 2 for the next session)
- UQ Extend Module 4 (and Module 5 for next session)
Summary takeaway
- Measuring psychological constructs requires careful operational definitions and awareness of scale properties.
- Reliability and validity determine whether our measures can support credible conclusions.
- Organizing and displaying data thoughtfully helps tell the right story and supports valid inferences for statistical testing.