Z-scores, percentiles, data cleaning, and Excel workflow - Study Notes

Z-scores, percentiles, and tails in a normal distribution

  • In a standard normal distribution, probabilities are read from Z-tables, which typically show:

    • the percentage of data between the mean and a given z-score (first column),

    • the percentage of data between that z-score and either tail (second/third column depending on table layout; the lecture refers to a “tail” column).

  • Key idea: because the normal distribution is symmetric around the mean, the sign of z only shifts the side of the distribution; the absolute value is what matters for the tail/inside-mean areas.

  • Example interpretations from the transcript:

    • A z-score of 0.5: about 19.5%19.5\% of the data lies between the mean and that z-score; about 30.85%30.85\% lies between that z-score and positive or negative infinity (tail area).

    • A z-score of 1: the tail area (on one side) is 15.87%15.87\%.

    • Because the distribution is symmetric, a z-score of -1 has the same tail percentage as +1; you use the absolute value to locate the corresponding tail.

    • An IQ score of 85 converts to a z-score of z=8510015=1z = \frac{85 - 100}{15} = -1, so about 15.87%15.87\% of people have IQ <= 85.

  • Formulas you should know:

    • Z-score to raw score: z=Xμσz = \frac{X - \mu}{\sigma} and equivalently X=μ+zσX = \mu + z\sigma

    • If you want the score that places someone in the top 1% (upper tail): find the z corresponding to the tail probability 0.01 (top tail). In the vignette, the chosen z is z=2.32z = 2.32\, (based on tail probability about 0.01020.0102 in that table, i.e., ~1.02% in the tail). Then convert back to a raw score using X=μ+zσX = \mu + z\sigma.

  • Worked example: top-1% raw IQ threshold

    • Given a normal IQ distribution with μ=100,σ=15\mu = 100, \sigma = 15, and z ≈ 2.322.32 for the top-1% tail,

    • Raw score threshold: X=100+2.32×15=134.8.X = 100 + 2.32 \times 15 = 134.8.

    • Conclusion: to be in the top 1%, you need a score of about 134.8134.8 or higher.

  • Another example: percentage of people with IQ between 109 and 127

    • Mean μ=100\mu = 100, σ=15\sigma = 15.

    • Z-scores: z<em>1090.6z<em>{109} \approx 0.6 and z</em>1271.8z</em>{127} \approx 1.8.

    • Percentages from the table:

    • Percentage between the mean and 109: 22.57%22.57\%.

    • Percentage between the mean and 127: 46.41%46.41\%.

    • Therefore, the percentage between 109 and 127 is 46.41%22.57%=23.84%.46.41\% - 22.57\% = 23.84\%.

  • Another range example: percentage between 84 and 110

    • Z-scores: lower bound 84 → z1.07z \approx -1.07, upper bound 110 → z0.67z \approx 0.67.

    • Percentages from the table:

    • Mean to z = -1.07: 35.77%35.77\%.

    • Mean to z = 0.67: 24.86%24.86\%.

    • The red middle area (between 84 and 110) corresponds to the sum of these two middle-range pieces: 35.77%+24.86%=60.63%.35.77\% + 24.86\% = 60.63\%.

    • Therefore, about 60.63%60.63\% of people have an IQ between 84 and 110.

  • Summary of practical approach:

    • To find a percentage in a range [a, b], compute the Z-scores for the boundaries, look up the corresponding percentages of data between the mean and each boundary, and combine appropriately (subtract the lower boundary from the upper boundary).


Data cleaning and screening: identifying impossible scores and ranges

  • Before describing or analyzing a dataset, you should:

    • become familiar with the data,

    • understand scales of measurement,

    • check for impossible scores (values that cannot occur given the variable).

  • Example of impossible scores:

    • Age: values like 180 are not plausible for a person of interest; such entries should be flagged as impossible.

    • Mistakes can arise from data entry errors or miscalibration during data collection.

  • Determine the possible range for each variable (to know what is plausible):

    • Agreeableness: measured on 10 items, each rated 1–5, so the total possible range is [10,50][10, 50].

    • Fear of teachers: 5 items, each 1–5, total range is [5,25][5, 25].

    • If a score is outside these ranges (e.g., 57 for agreeableness), that is an impossible score for that variable.

  • How to handle impossible scores when computing statistics:

    • Descriptive statistics (mean, SD): remove impossible scores independently for each variable before calculation. This yields correct means and SDs for each variable.

    • If you include impossible scores, the mean will be biased (pulled up by high outliers or pulled down by low outliers) and the standard deviation will increase due to added variability.

    • Correlation analyses: when correlating two variables, you must remove the score for a participant on both variables if either is missing or impossible on that pair. Only pairwise complete data are used for each correlation.

    • Do not remove a participant’s data from one variable unless you are comfortable removing their data for the other variable involved in the correlation as well.

    • If a dataset contains extra variables not used in a particular analysis, you do not need to remove their observations unless those variables are involved in the analysis.

  • Practical example in a multi-variable assignment:

    • You will perform two correlations (involving four variables total).

    • If a participant has an impossible score on Agreeableness, also remove their Fear of teachers score if those two variables are being correlated.

    • You do not need to remove the participant’s data from every other variable unless those variables are included in the correlation you’re performing.

  • Quick takeaway: clean separately for descriptive stats, then clean by pair for each correlation; data cleaning is a two-step process.


Excel-based data cleaning and data analysis workflow (Week 7 exercise)

  • Context: a psychologist collected data from 30 children of refugees. Variables:

    • Subject ID (just an identifier)

    • Proximity score: range 0–60 (days since family left village/town/c city)

    • PTSD level: range 10–100 (summed trauma impact)

    • Sleep quality: range 9–81 (average sleep quality from interviews/polysomnography)

  • Data quality issues observed:

    • Miscalibration of questionnaire computers on some days

    • Parental misunderstanding leading to impossible values

  • Data cleaning workflow in the Excel workbook (Week 7, Excel Exercise 3):

    • Keep a raw data copy on a separate sheet/file for reference.

    • Use conditional formatting to highlight impossible scores (visually identify out-of-range values).

    • Apply conditional formatting separately for each variable because each has its own valid range:

    • Proximity score: 0–60 (highlight out-of-range values, e.g., blue).

    • PTSD level: 10–100 (highlight out-of-range values; four examples flagged in the transcript).

    • Sleep quality: 9–81 (highlight out-of-range values).

    • Also add a rule to highlight blank cells differently (e.g., green) so you can distinguish blanks from impossible values.

    • When using a Mac vs Windows, note minor UI differences for setting up conditional formatting rules (the logic remains the same).

  • Visual inspection: build a scatter plot to explore relationships between proximity score and PTSD level

    • Copy the conditionally formatted data to a new sheet so the formatting is preserved for quick visual checking.

    • Create a scatter plot (Insert -> Charts -> Scatter) using proximity score (x-axis) and PTSD level (y-axis).

    • Add axis titles and a chart title (e.g., Relationship between proximity and PTSD).

    • Color code the dots: leave the “real” data in blue and manually color the impossible scores in orange for clear visibility.

    • This helps identify outliers and verify there are plausible relationships between the variables.

  • Preparing descriptive statistics: move to the Functions (or similar) worksheet

    • Copy the conditionally formatted data back to the Functions sheet.

    • Remove the impossible scores from the dataset used for statistics (delete those data points so they don’t affect the descriptive stats).

    • Remove conditional formatting rules except for the rule that marks blanks, so data calculations are not biased by formatting artifacts.

    • Compute means for each variable (Proximity, PTSD, Sleep):

    • Proximity mean: xˉ<em>prox=x</em>in\bar{x}<em>{prox} = \frac{\sum x</em>i}{n} or in Excel: =AVERAGE(range)=AVERAGE(range)

    • PTSD mean: xˉ<em>PTSD=x</em>in\bar{x}<em>{PTSD} = \frac{\sum x</em>i}{n}

    • Sleep mean: xˉ<em>sleep=x</em>in\bar{x}<em>{sleep} = \frac{\sum x</em>i}{n}

    • Compute deviations and squared deviations (the instructor does this in a single combined workflow):

    • For each observation, compute the deviation: d<em>i=x</em>ixˉd<em>i = x</em>i - \bar{x}

    • Square deviations: d<em>i2=(x</em>ixˉ)2d<em>i^2 = (x</em>i - \bar{x})^2

    • Use absolute referencing for the mean when filling down the column so copy operations keep referencing the correct mean cell.

    • Remove deviation values corresponding to impossible scores so you only have deviations for valid observations.

    • Compute sum of squares for each variable:

    • SS<em>X=d</em>i2SS<em>{X} = \sum d</em>i^2 over all valid observations.

    • Determine the number of observations (n) for each variable (exclude impossible scores):

    • nprox=COUNT(proxrange)n_{prox} = \text{COUNT}(prox_range)

    • nPTSD=COUNT(PTSDrange)n_{PTSD} = \text{COUNT}(PTSD_range)

    • nsleep=COUNT(sleeprange)n_{sleep} = \text{COUNT}(sleep_range)

    • Compute variance and standard deviation:

    • Variance: σ2=SSn\sigma^2 = \dfrac{SS}{n}

    • Standard deviation: σ=σ2\sigma = \sqrt{\sigma^2}

    • In the example values from the transcript, the counts were:

    • Proximity: 28 observations

    • PTSD: 26 observations

    • Sleep: 27 observations

    • Record the resulting descriptive statistics for use in subsequent analyses.

  • Extension and practice materials:

    • An extension worksheet contains the same data for 30 extra subjects to practice descriptive statistics calculations.

    • There is a separate worksheet for hand-calculated z-score practice with Z-tables provided (downloadable from Blackboard).

    • The instructor notes that final exam questions will be very similar to these practice worksheets, so completing them is advised.

  • Correlation prep and data handling guidance (context for the assignment):

    • You will perform two correlations using four variables total.

    • When calculating correlations, you must ensure that each pair of variables is complete for the observations you include (i.e., you must have data for both variables in the pair for a given participant).

    • If a participant has an impossible score for one variable involved in a correlation, you should exclude that participant’s data for that correlation by removing both variables’ values for that observation (i.e., you keep other variables intact, but the pair used for the correlation should be complete).

  • Final notes for exam readiness:

    • Practice identifying and interpreting z-scores, and reading z-tables for mean-to-z and z-to-tail percentages.

    • Be comfortable converting between raw scores and z-scores using X=μ+zσX = \mu + z\sigma and z=Xμσz = \dfrac{X - \mu}{\sigma}.

    • Be able to compute a raw score corresponding to a high percentile (e.g., top 1%) using the appropriate z-value and the distribution parameters.

    • Understand the data-cleaning workflow: identifying impossible values, determining plausible ranges, and applying a two-step cleaning approach for descriptive statistics vs. correlation analyses.

    • Be able to outline a practical Excel workflow for cleaning data, creating scatter plots, and computing descriptive statistics (means, deviations, sum of squares, variance, and standard deviation).

  • Quick refresher on IQR (interquartile range) concept (from the related discussion):

    • IQR = Q3 − Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile.

    • When using Z-tables to approximate these quartiles for a normally distributed variable, you look for the z-values that correspond to 25% in the tails on either end (i.e., z ≈ ±0.674 for a standard normal).

    • The lecture notes suggest using z ≈ ±0.68 as a practical approximation, giving an IQR-related span of about 2×0.68σ1.36σ2\times 0.68\,\sigma ≈ 1.36\,\sigma when you know the standard deviation. (Exact IQR depends on the data’s empirical quartiles, but this illustrates the idea of using symmetric tails.)

  • If you want to review further, download the Z-tables from Blackboard and work through the practice worksheets to build familiarity with the types of questions likely to appear on the final exam.