Notes on Variable Types, Probability, and Medical Testing (Comprehensive)
Variable Types and Measurement Levels
- There are different levels for variables: nominal, ordinal, and numerical (interval/ratio).
- Key idea: nominal variables have categories with no intrinsic order; ordinal variables have a meaningful order; numerical variables are quantitative and can be averaged.
- Examples discussed:
- Exterior materials (wood siding, cement, brick, etc.) are nominal because there is no natural ordering among materials.
- The two main quality levels mentioned were “high” quality and similar categories, indicating an ordinal-like ordering at times, but the focus for materials is nominal.
- Coding of categorical variables:
- A categorical variable can be encoded with numbers (e.g., gender coded as 1 for male and 0 for female). The numeric code is just a label, not a measurement.
- Case study example (survey of 253 students) with several variables to classify as nominal, ordinal, or numerical:
- Gender: coded as 0/1 (female/male). Since there is no natural order between male and female, this is nominal.
- Class year: coded as integers (e.g., 1 for freshman, 4 for senior; with 2 and 3 for sophomores/juniors). There is a clear order (freshman → sophomore → junior → senior), so this is ordinal.
- Early riser vs. owl (lark, neither, owl): nominal because there is no inherent order among lark, neither, and owl.
- Number of classes in the semester: numerical (a count).
- Anxiety score on a scale (e.g., 1 to 8 or 1 to 10): numerical, since averages can be taken (e.g., an average score like 6.5).
- Categorical vs numerical distinction when categorizing scores (normal/moderate/severe vs a numerical anxiety score): if the data are categories with a meaningful order (normal < moderate < severe), it’s ordinal; if the data are raw numeric scores, they’re numerical.
- Transforming numerical to categorical (and vice versa):
- You can convert a numerical variable to a categorical variable by binning into intervals. Example: converting year built into ranges (before 1900, 1900–1925, etc.) for a bar plot instead of a histogram.
- It is generally not appropriate to convert a purely categorical variable into a numerical value that implies a mathematical quantity. Numbers used to code categories are just labels, not measurements.
- Summary of conversion rules:
- Numerical → Categorical: possible via binning into intervals.
- Categorical → Numerical: not truly appropriate because the categories are labels, not ordered numeric quantities (unless you assign meaningful scores to categories and treat them as ordinal, but that involves assumptions).
Practice: Classifying Variables (Nominal vs Numerical)
- Example 1: Born region (or country) of birth – nominal (categories with no natural order).
- Example 2: Employment status (e.g., employed, unemployed) – nominal.
- Example 3: Day of the week someone was born (Mon–Sun) – nominal, even though days can be ordered in a week, the day labels themselves do not have a mathematical order.
- Example 4: Distance from birthplace in kilometers – numerical.
- Note: Some questions in class involve deciding whether a variable is nominal, ordinal, or numerical; ordering or lack of natural order drives the classification.
Key Concepts: Variables vs Statistics vs Parameters
A variable (for each subject) is a value measured or observed for that subject.
A statistic is a summary value computed from a sample (e.g., the sample mean
ar{x} = rac{1}{n}\sum{i=1}^n xi).
A parameter is the corresponding quantity for the entire population.
Common sources of confusion:
- The average amount of sleep in a class sample is a statistic, not a per-subject variable.
- The proportion of left-handed students in the class is a statistic, not a per-subject measurement.
- The sample size $n$ describes the study, not a variable observed on each subject (it is a fixed count for the dataset in this context).
Important takeaway: treat sample-level summaries as statistics; treat population-level quantities as parameters; the sample size is a descriptor of the study, not a per-subject measurement.
Scenarios: Deterministic vs. Nondeterministic Processes
- Scenario 1 (savings accounts): Marcus and Caroline each put $50 into savings with a fixed interest rate $r=2.5\%$ compounded monthly for $t=6$ months.
- Will the final amounts be the same? If the inputs are identical and the process is deterministic, the ending balances should be the same.
- The formula for compound interest (monthly compounding) is:
A = P\left(1+\frac{r}{m}\right)^{mt}
where $P=50$, $r=0.025$, $m=12$, $t=0.5$ (six months). - In this deterministic setup, the two accounts should yield the same amount after six months.
- Scenario 2 (product popularity focus groups): A marketing executive surveys 20 young adults for initial ratings; another group of 20 may yield different ratings.
- This is nondeterministic: same input (focus group) can yield different results due to sampling, opinions, etc.
- Deterministic vs nondeterministic: if the same input always produced the same output, the process would be deterministic; if outcomes vary, it is nondeterministic and we use probability to quantify likelihood of events.
- Probability basics:
- Probability is defined as the long-run frequency of an event when the process is repeated many times: P(A) = \lim_{n\to\infty} \frac{\text{# of times A occurs in n trials}}{n}.
- For equally likely outcomes (e.g., a fair coin):
P(\text{Heads}) = \frac{1}{2}, \quad P(\text{4 on a fair die}) = \frac{1}{6}. - In real life, not all outcomes are equally likely; probabilities are often estimated from data (census, samples, etc.).
- Base rate (prevalence):
- Base rate is the chance that a condition appears in the whole population (population prevalence).
- The base rate depends on place and time (e.g., polio prevalence changes over eras and locations).
- Base rate is essential when interpreting test results, especially in medical testing, because it influences post-test probabilities.
- Probability scale:
- Probabilities lie in the interval $[0,1]$ (0% to 100%).
- Negative probabilities or probabilities >1 are not meaningful and indicate an error in calculation or interpretation.
- 0% means the event is almost surely not going to happen; 100% means the event is certain; 50% means the event is equally likely as not.
- Practical takeaway: probability theory helps model uncertainty in real-world processes and distinguishes between fixed inputs and variable outputs.
Teacher Satisfaction: Contingency Tables and Exact vs Estimated Probabilities
- Contingency table setup (urban school district): Teaching level (Elementary, Middle, High) × Job satisfaction (Satisfied, Not Satisfied).
- Exact probabilities (from the sample of 674 respondents):
- Probability a randomly selected respondent teaches in high school:
P(\text{High}) = \frac{\text{Number in High}}{674} = \frac{205}{674} = 0.3042. - Probability a randomly selected respondent is satisfied with their job:
P(\text{Satisfied}) = \frac{341}{674}. - Probability a respondent teaches in middle school and is not satisfied: (cell count / 674) [specific cell count not fully disclosed in the transcript].
- Probability a randomly selected respondent teaches in high school:
- Conditional probability (within a subgroup):
- Focus on elementary teachers only (row): 307 elementary teachers; among them, 125 are satisfied.
- Within elementary group:
P(\text{Satisfied} | \text{Elementary}) = \frac{125}{307} = 0.4072.
- Conditional probability example (not satisfied → probability of high school):
- Among not satisfied teachers, total = 333; among them, 74 teach in high school.
- Then:
P(\text{High} \mid \text{Not Satisfied}) = \frac{74}{333} = 0.2222.
- Distinguishing exact vs estimated probabilities:
- Exact probabilities are computed directly from the given sample counts (as above).
- Estimated probabilities focus on a subgroup (e.g., elementary teachers) to infer about that subgroup or the population of teachers in the district.
- The two last probabilities illustrate how focusing on a subset changes the denominator and the context of the probability.
- Key takeaway: probabilities derived from a contingency table can be exact (from the sample) or conditional (within a subpopulation). Inference to the population relies on representativeness of the sample.
- Recap on probability phrasing:
- Given a random respondent, what is the probability they have a specific characteristic? Use the proportion across all respondents (denominator is the total sample size).
- If you know the respondent belongs to a subgroup, what is the probability of another characteristic within that subgroup? Use the subgroup totals (denominator is the subgroup size).
- If you ask for a joint event (e.g., High School and Not Satisfied), focus on the cell count for that combination divided by the total sample size.
Medical Testing: Sensitivity, Specificity, and Base Rate
- Test outcomes and terminology:
- True positive: test result is positive, and the person has the condition.
- True negative: test result is negative, and the person does not have the condition.
- False positive: test result is positive, but the person does not have the condition.
- False negative: test result is negative, but the person does have the condition.
- Key rates:
- Sensitivity (true positive rate):
\text{Sensitivity} = P(\text{Positive} \mid \text{Disease}) - Specificity (true negative rate):
\text{Specificity} = P(\text{Negative} \mid \text{No Disease}) - False positive rate: \text{FPR} = 1 - \text{Specificity}
- False negative rate: \text{FNR} = 1 - \text{Sensitivity}
- Sensitivity (true positive rate):
- Base rate (prevalence):
- The prevalence in the population is: P(\text{Disease}).
- The base rate is essential for interpreting post-test probabilities; the same test with given sensitivity/specificity can yield different post-test probabilities depending on prevalence.
- HIV test example (illustrative numbers):
- Base rate (prevalence) in the states: 0.34\% = 0.0034.
- Sensitivity: 0.75 (true positive rate).
- False positive rate: 0.04 (so specificity is 1 - 0.04 = 0.96).
- Therefore, the test characteristics are:
\text{Sensitivity} = 0.75, \text{Specificity} = 0.96, \text{FPR} = 0.04.
- Practical implications and interpretation:
- A false positive can cause unnecessary anxiety, isolation, and treatment; a false negative can miss a diagnosis and allow continued spread or progression of disease.
- Base rate affects the probability that a positive test truly indicates disease (post-test probability). A full interpretation requires Bayes’ rule:
P(\text{Disease} \mid \text{Positive}) = \frac{\text{Sensitivity} \cdot \text{Prevalence}}{\text{Sensitivity} \cdot \text{Prevalence} + \text{FPR} \cdot (1 - \text{Prevalence})}.
- Example values from the HIV scenario (for context):
- Prevalence (base rate): 0.0034
- Sensitivity: 0.75
- Specificity: 0.96
- False positive rate: 0.04
- Additional clinical interpretation notes:
- Always consider base rate when communicating test results to patients; a positive result does not guarantee disease without considering prevalence and test characteristics.
- The balance of sensitivity and specificity is crucial in screening vs. confirmatory testing strategies; higher sensitivity reduces false negatives, higher specificity reduces false positives.
- Ethical and practical implications:
- Misinterpretation of test results can lead to stigma, unnecessary treatment, or missed diagnoses.
- Public health decisions should incorporate base rates and test characteristics to avoid misinforming patients or policy.
Connections to Foundational Principles
- Measurement scales (nominal, ordinal, numerical) underpin data analysis choices, including which summary statistics are appropriate and which statistical tests to apply.
- The idea of converting between measurement levels (numerical → categorical) can simplify visualization (e.g., bar plots) but may incur information loss; converting categorical to numerical is generally inappropriate unless a meaningful ordinal interpretation is assigned.
- Probability theory connects data descriptions (contingency tables) to real-world uncertainty, guiding decisions under uncertainty and informing interpretation of tests and surveys.
- The interplay between sensitivity, specificity, and prevalence highlights the importance of context when interpreting diagnostic tests and the impact of base rates on predictive values.
Quick Reference: Key Formulas and Definitions
- Probability basics:
- If outcomes are equally likely: P(A) = \frac{\text{# of favorable outcomes}}{\text{# of possible outcomes}}.
- For a fair coin: P(\text{Heads}) = \frac{1}{2}.
- Compound interest (monthly):
A = P\left(1+\frac{r}{m}\right)^{mt}
where:
- $P$ = principal, $r$ = annual rate, $m$ = number of compounding periods per year, $t$ = time in years.
- Contingency table probabilities:
- Exact probability from table: cell count / total count.
- Conditional probability: P(A|B) = \frac{P(A \cap B)}{P(B)}.
- Medical test terminology:
- Sensitivity: \text{Sensitivity} = P(\text{Positive} \mid \text{Disease}).
- Specificity: \text{Specificity} = P(\text{Negative} \mid \text{No Disease}).
- False Positive Rate: \text{FPR} = 1 - \text{Specificity}.
- False Negative Rate: \text{FNR} = 1 - \text{Sensitivity}.
- Base rate (Prevalence): P(\text{Disease}).
- Bayes-style post-test probability (conceptual):
P(\text{Disease} \mid \text{Positive}) = \frac{\text{Sensitivity} \cdot \text{Prevalence}}{\text{Sensitivity} \cdot \text{Prevalence} + \text{FPR} \cdot (1 - \text{Prevalence})}.