Variables: Types, Frequency Tables & Bar Charts (Part 1)
Definition of a Variable
- A variable is any characteristic or attribute of a person or thing that can take on different values.
- May be expressed as a number (e.g., height) or assigned to a category (e.g., gender).
- Real-world relevance: variables are the basic building blocks of all statistical analyses; misidentifying them leads to flawed data collection and interpretation.
Variable Notation: Upper-case vs Lower-case
- Use upper-case letters (typically X,Y,Z) to denote the variable itself (the entire column of data).
- Example: X = Height of each person in a study.
- Use lower-case letters (typically x,y,z) for a single observed value of that variable.
- Example: x for the 5th person’s height; if that person is 60 inches then x=60.
- Significance: clear notation prevents mix-ups between the variable definition and individual data points when writing formulas or code.
Major Types of Variables
Categorical (Qualitative)
- Records membership in categories.
- Examples: blood type, hair color, country of origin, gender.
- Special sub-type: Ordinal variables—categories are ranked or ordered.
- Examples: freshman → sophomore → junior → senior; satisfaction levels (low, medium, high).
- Ethical note: be careful—although ordinal categories have order, the “distance” between ranks is not uniform.
Numerical (Quantitative)
- Records counts or measurements; meaningful arithmetic can be performed.
- Examples: speed limit, age, number of daily meals.
Discrete vs Continuous Numerical Variables
- Discrete
- Finite or countably infinite set of possible values; often whole numbers.
- No intermediate values between two successive points (no 1.5 heads when counting coin tosses).
- Examples: number of pets, courses taken, books in a library, heads out of 10 flips.
- Continuous
- Can take any value within an interval; typically obtained via measurement.
- Examples: head circumference, weight, time between bus arrivals, length of James Harden’s beard.
- Practical implication: need to decide precision (e.g., seconds vs milliseconds) and consider measurement error.
Flowchart Summary (Narrative)
- Variable → Categorical vs Numerical.
- If Categorical → Ordinal? • Yes → e.g., class standing. • No → e.g., blood type.
- If Numerical → Continuous? • Yes → measurement of time, weight, length. • No (Discrete) → count of children, course load.
Organizing Categorical Data
Frequency Distribution (FD)
- A table that lists each distinct category and the number of times (frequency) it appears.
- Purpose: quick visual of how data are spread; foundational for later charts.
Relative Frequency Distribution (RFD)
- Same structure as FD, but replace raw counts with relative frequencies:
Relative Frequency=∑ff - Converts counts into proportions; facilitates comparison across samples of different sizes.
- Values sum to 1 (or 100 %).
Constructing FD & RFD: Ship-Destination Example (15 Ships)
- Destinations: Bermuda, Southampton, Mediterranean, Caribbean.
- Tally procedure: scan list, mark each occurrence; confirms manual accuracy.
- Frequency counts obtained: 6 (Bermuda), 4 (Southampton), 2 (Mediterranean), 3 (Caribbean). (Always verify total 6+4+2+3=15.)
- Relative frequencies:
- Bermuda 156=0.40
- Southampton 154=0.27
- Mediterranean 152=0.13
- Caribbean 153=0.20
- Best practice: give the table a clear, informative title so colleagues (or your boss) can understand it at a glance.
Choosing Between FD and RFD
- Use Relative Frequency when comparing two or more data sets of unequal size.
- Frequency is fine for a single data set where absolute counts are meaningful.
Visualizing Categorical Data: Bar Charts
- Each category is represented by a separate vertical bar.
- Height = frequency (or relative frequency).
- Bars do not touch because categories are discrete, not on a numeric continuum.
- Only appropriate for categorical variables.
- Titles, labeled axes, and units (if any) are mandatory for clarity.
MLB Nationality Example (198 Non-US Players)
- FD & RFD show most players from Dominican Republic, followed by Venezuela, Canada, Cuba, Mexico.
- Possible real-world reasons (hypotheses):
- Year-round warm climate (≈$80^{\circ}$ F) → more outdoor practice opportunities.
- Heavy scouting & investment by MLB organizations in Dominican & Venezuelan youth academies.
- Ethical implication: resource concentration in certain countries can shape athletic opportunities.
Quality Checks for Bar Charts
- Ensure all categories appear—omitting one (e.g., Cuba) invalidates conclusions.
- Match data type plotted to axis label—if axis says “Frequency” but heights reflect percentages, viewers are misled.
- Verify ordering: tallest bar should correspond to greatest frequency; quick scan for obvious mismatches.
- Maintain equal bar widths & consistent scale to avoid perceptual distortion.
Faulty Graph Examples Discussed
- Graph 1: Missing Cuba, y-axis labeled "Frequency" but uses relative frequencies.
- Graph 2: Axis label mismatch again; bars labeled incorrectly—Mexico shown as highest instead of Dominican Republic.
Practical, Ethical & Philosophical Reflections
- Correct variable identification underpins ethical data usage—misclassification can lead to erroneous policy or business decisions.
- Clear tabular & graphical presentations respect the audience’s limited time and reduce cognitive load.
- Visualization errors (missing categories or mislabeled axes) propagate misinformation quickly; diligence in checks is a professional responsibility.
- Investment disparities (e.g., baseball academies) highlight how data can reveal broader social and economic structures.
- Relative Frequency: RF=nf where f is class frequency & n=∑f.
- Discrete set notation example: 0,1,2,…,10 heads in 10 flips (no fractions allowed).
- Continuous interval example: beard length ∈[0,∞) inches, theoretically infinite granularity.
Study Tips & Connections
- Always start any analysis by writing down variable type; it dictates permissible statistics and plots.
- Tie new concepts back to foundational ideas (e.g., measurement error, sampling design).
- When preparing reports, imagine handing them to a busy supervisor—would they understand context instantly? If not, add titles/labels.
- Practice by classifying everyday data you encounter (streaming-service categories, grocery receipts, fitness-tracker numbers) into discrete/continuous or categorical/ordinal.