Chapter 1 Notes: Variables, Data Types, and Graphical Representations
Overview: population, sample, and variables
A sample is chosen to draw conclusions or answer questions about a population. Example in the transcript: 400 students as the sample.
Each student is a unit of observation; the individuals are numbered (student 1, student 2, …, student 400).
Variables are the characteristics measured or observed across the units (e.g., religion, country/state of origin, calculus course performance).
The first column labeled “student” is an identifier, not a variable. It is used to organize data, not to describe a population characteristic.
Data collection method mentioned: surveys. After collecting data, researchers organize it in a table for analysis.
Key concepts: variables and their values
Variables are the columns in a data table; each column represents a variable.
A value (or data point) is the entry in a cell under that variable for a given unit.
Value types are primarily numerical (quantitative) or categorical (qualitative).
Example of numerical values in the dataset: scores, ZIP codes treated as numbers but not suitable for arithmetic operations.
Example of non-numerical labels: ZIP codes often function as geographical labels, not numbers to be used in arithmetic.
The difference between a value and a variable: a variable is a characteristic; a value is a specific measurement of that characteristic for a unit.
Numerical vs. categorical variables
Numerical variables contain numbers and are suitable for mathematical operations (mean, median, etc.).
Categorical variables contain labels or categories (names, labels, or codes) and may be numerical labels but function as categories.
Example of numerical values: scores, height, time, weight, ZIP codes treated as labels rather than quantities for arithmetic.
Example of categorical values: color, gender, ethnicity, academic program, and even player numbers used as labels rather than quantities.
Variable naming considerations: names are arbitrary but should be descriptive; singular vs. plural naming is common (e.g., Score, AcademicProgram).
If a variable name is a single letter (e.g., x), a note should accompany the table explaining what it stands for (e.g., x = score).
Types of numerical variables
Discrete numerical variables: counts that appear as separate, distinct values (e.g., number of children in a family).
Continuous numerical variables: measurements that can take any value within an interval (e.g., height, time, weight).
Examples from the transcript:
Number of children: discrete (0, 1, 2, 3, …), obtained by counting.
Height: continuous, obtained by measurement with a ruler or measuring tape.
Time to fall: continuous, measured with a stopwatch.
Important distinction: discrete values come from counting; continuous values come from measuring.
Types of categorical values
Categorical values can be nominal or ordinal.
Nominal: categories without a natural order (e.g., color, gender, ethnicity).
Ordinal: categories with a natural order (e.g., grades like A, B, C; or Likert scales).
The transcript discusses how grades can be nominally represented as categories (e.g., A, B, C) with an implied order (A is higher than B), where an A also corresponds to a range (e.g., 90–100). This illustrates that ordinal categories can have associated numerical ranges.
A note on grading: A grade category may represent a range of values, not a single numerical value.
In some cases, numbers used as labels (e.g., player numbers, ZIP codes) function as categorical labels rather than quantitative values.
Organizing data: raw data vs organized data
Raw data: the unorganized measurements collected from observations (e.g., a list of grades orScores).
Organized data: data arranged for analysis, typically in a table with clearly defined variables.
When focusing on a categorical variable, we create a frequency table (a distribution of categories).
Frequency table and proportions
A frequency table lists each category (without repetition) and the count of observations in that category.
Example structure: categories (A, B, C, D) with counts (fA, fB, fC, fD).
Totals:
Sample size: N = sum of all frequencies (e.g., N = fA + fB + fC + fD).
Relative frequencies (proportions): pi = \frac{fi}{N} with \sumi pi = 1.
Percentages can be derived from proportions: e.g., 20% for A if p_A = 0.20\,.
The transcript notes: sums of frequencies equal the sample size, and sums of relative frequencies equal 1.
Graphical representations of data
Frequency bar graph (for categorical data)
Vertical axis represents frequency (count) or proportion depending on the axis labeling.
Draw a bar for each category; bars are separated (no touching) to emphasize category distinctness.
Histogram (for continuous data)
Bars touch each other, reflecting the continuity of the variable (no gaps between categories).
A histogram is used when the variable is continuous, not categorical.
Key distinctions:
Bar graph: used for categorical variables; bars do not touch.
Histogram: used for numerical continuous variables; bars touch.
Practical example: weight data (Born babies file)
Weight in pounds is a continuous numerical variable.
Categories are not given; we must construct them by choosing class intervals (bins).
Decision factor: number of bars (bins) to display. The transcript shows a scenario with eight bars.
Class width (class length): the width of each bin; expressed as w.
A formulaic approach is described for determining w:
Class length (width) w is tied to the range of data and the desired number of classes: w \approx \frac{\text{max} - \text{min}}{k}, where k is the number of classes.
In the example, with a minimum value of 5.6 and eight bars, the computed width is about w \approx 0.5 (rounded to the nearest tenth).
Constructing the histogram bins:
Start at the lowest value: 5.6.
Class boundaries example: [5.6, 6.1), [6.1, 6.6), [6.6, 7.1), etc.
Boundaries are inclusive at the lower bound and exclusive at the upper bound, except the final bin may include the maximum.
Frequencies per bin are obtained by counting how many data points fall into each interval.
Note: The exact frequencies per bin are not given in the transcript, but the method to determine them is outlined.
Additional notes on data semantics and best practices
The “name of a variable” is arbitrary but should be descriptive; example: AcademicProgram as a variable name with descriptive labeling rather than an abstract code.
When using symbolic names (e.g., X) for a variable, provide a legend or note clarifying what X represents (e.g., X means Score).
The transcription emphasizes that some numeric-looking identifiers (like ZIP codes or player numbers) behave as categorical labels rather than numerical values to be manipulated arithmetically.
Understanding the natural order: ordinal data have a natural ranking; nominal data do not.
Quick references and formulas (summary)
Mean (average) for numerical data: \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi
Proportions from a frequency table: pi = \frac{fi}{N}, \quad \sumi pi = 1
Sample size from frequencies: N = \sumi fi
Histogram bin width approximation: w \approx \frac{\text{max} - \text{min}}{k}
Class boundaries concept: [min, min+w), [min+w, min+2w), …
Connections to broader statistics concepts
Distinguishing variable types is foundational for choosing appropriate descriptive statistics and visualization (means/medians for numerical data; frequencies and mode for categorical data).
Understanding raw data vs organized data prepares you for data cleaning, transformation, and analysis workflows.
Choice of bin width in histograms affects interpretability and should reflect data distribution; rule-of-thumb as shown in the example demonstrates practical data-driven binning.
Quick practice prompts (to test understanding)
Identify which of the following are variables and classify them as numerical or categorical: height, ZIP code, gender, academic program, student ID.
For a categorical dataset with categories A, B, C, D and frequencies 4, 8, 5, 3, compute N, the proportions, and the percentages.
Explain why ZIP code is treated as a non-numerical value for mathematical operations even though it consists of digits.
Given a continuous variable with min = 5.6, max = 9.6, and you want 8 bins, compute the bin width and write the first three bin boundaries.
Distinguish between when you would use a bar graph vs a histogram for a given dataset.