Variables: Types, Frequency Tables & Bar Charts (Part 1)

Definition of a Variable

  • A variable is any characteristic or attribute of a person or thing that can take on different values.
    • May be expressed as a number (e.g., height) or assigned to a category (e.g., gender).
    • Real-world relevance: variables are the basic building blocks of all statistical analyses; misidentifying them leads to flawed data collection and interpretation.

Variable Notation: Upper-case vs Lower-case

  • Use upper-case letters (typically X,Y,ZX, Y, Z) to denote the variable itself (the entire column of data).
    • Example: XX = Height of each person in a study.
  • Use lower-case letters (typically x,y,zx, y, z) for a single observed value of that variable.
    • Example: xx for the 5th person’s height; if that person is 60 inches then x=60x = 60.
  • Significance: clear notation prevents mix-ups between the variable definition and individual data points when writing formulas or code.

Major Types of Variables

Categorical (Qualitative)

  • Records membership in categories.
  • Examples: blood type, hair color, country of origin, gender.
  • Special sub-type: Ordinal variables—categories are ranked or ordered.
    • Examples: freshman → sophomore → junior → senior; satisfaction levels (low, medium, high).
    • Ethical note: be careful—although ordinal categories have order, the “distance” between ranks is not uniform.

Numerical (Quantitative)

  • Records counts or measurements; meaningful arithmetic can be performed.
    • Examples: speed limit, age, number of daily meals.

Discrete vs Continuous Numerical Variables

  • Discrete
    • Finite or countably infinite set of possible values; often whole numbers.
    • No intermediate values between two successive points (no 1.5 heads when counting coin tosses).
    • Examples: number of pets, courses taken, books in a library, heads out of 10 flips.
  • Continuous
    • Can take any value within an interval; typically obtained via measurement.
    • Examples: head circumference, weight, time between bus arrivals, length of James Harden’s beard.
    • Practical implication: need to decide precision (e.g., seconds vs milliseconds) and consider measurement error.

Flowchart Summary (Narrative)

  • Variable → Categorical vs Numerical.
    • If Categorical → Ordinal? • Yes → e.g., class standing. • No → e.g., blood type.
    • If Numerical → Continuous? • Yes → measurement of time, weight, length. • No (Discrete) → count of children, course load.

Organizing Categorical Data

Frequency Distribution (FD)

  • A table that lists each distinct category and the number of times (frequency) it appears.
  • Purpose: quick visual of how data are spread; foundational for later charts.

Relative Frequency Distribution (RFD)

  • Same structure as FD, but replace raw counts with relative frequencies:
    Relative Frequency=ff\text{Relative Frequency} = \frac{f}{\sum f}
  • Converts counts into proportions; facilitates comparison across samples of different sizes.
  • Values sum to 11 (or 100 %).

Constructing FD & RFD: Ship-Destination Example (15 Ships)

  • Destinations: Bermuda, Southampton, Mediterranean, Caribbean.
  • Tally procedure: scan list, mark each occurrence; confirms manual accuracy.
  • Frequency counts obtained: 6 (Bermuda), 4 (Southampton), 2 (Mediterranean), 3 (Caribbean). (Always verify total 6+4+2+3=156+4+2+3 = 15.)
  • Relative frequencies:
    • Bermuda 615=0.40\frac{6}{15}=0.40
    • Southampton 415=0.27\frac{4}{15}=0.27
    • Mediterranean 215=0.13\frac{2}{15}=0.13
    • Caribbean 315=0.20\frac{3}{15}=0.20
  • Best practice: give the table a clear, informative title so colleagues (or your boss) can understand it at a glance.

Choosing Between FD and RFD

  • Use Relative Frequency when comparing two or more data sets of unequal size.
  • Frequency is fine for a single data set where absolute counts are meaningful.

Visualizing Categorical Data: Bar Charts

  • Each category is represented by a separate vertical bar.
    • Height = frequency (or relative frequency).
    • Bars do not touch because categories are discrete, not on a numeric continuum.
  • Only appropriate for categorical variables.
  • Titles, labeled axes, and units (if any) are mandatory for clarity.

MLB Nationality Example (198 Non-US Players)

  • FD & RFD show most players from Dominican Republic, followed by Venezuela, Canada, Cuba, Mexico.
  • Possible real-world reasons (hypotheses):
    • Year-round warm climate (≈$80^{\circ}$ F) → more outdoor practice opportunities.
    • Heavy scouting & investment by MLB organizations in Dominican & Venezuelan youth academies.
    • Ethical implication: resource concentration in certain countries can shape athletic opportunities.

Quality Checks for Bar Charts

  1. Ensure all categories appear—omitting one (e.g., Cuba) invalidates conclusions.
  2. Match data type plotted to axis label—if axis says “Frequency” but heights reflect percentages, viewers are misled.
  3. Verify ordering: tallest bar should correspond to greatest frequency; quick scan for obvious mismatches.
  4. Maintain equal bar widths & consistent scale to avoid perceptual distortion.
Faulty Graph Examples Discussed
  • Graph 1: Missing Cuba, y-axis labeled "Frequency" but uses relative frequencies.
  • Graph 2: Axis label mismatch again; bars labeled incorrectly—Mexico shown as highest instead of Dominican Republic.

Practical, Ethical & Philosophical Reflections

  • Correct variable identification underpins ethical data usage—misclassification can lead to erroneous policy or business decisions.
  • Clear tabular & graphical presentations respect the audience’s limited time and reduce cognitive load.
  • Visualization errors (missing categories or mislabeled axes) propagate misinformation quickly; diligence in checks is a professional responsibility.
  • Investment disparities (e.g., baseball academies) highlight how data can reveal broader social and economic structures.

Key Formulas & Notation Recap

  • Relative Frequency: RF=fn\text{RF} = \frac{f}{n} where ff is class frequency & n=fn = \sum f.
  • Discrete set notation example: 0,1,2,,10{0,1,2,\dots,10} heads in 10 flips (no fractions allowed).
  • Continuous interval example: beard length [0,)\in [0,\infty) inches, theoretically infinite granularity.

Study Tips & Connections

  • Always start any analysis by writing down variable type; it dictates permissible statistics and plots.
  • Tie new concepts back to foundational ideas (e.g., measurement error, sampling design).
  • When preparing reports, imagine handing them to a busy supervisor—would they understand context instantly? If not, add titles/labels.
  • Practice by classifying everyday data you encounter (streaming-service categories, grocery receipts, fitness-tracker numbers) into discrete/continuous or categorical/ordinal.