Chapter 1 Notes: Variables, Data Types, and Graphical Representations

Overview: population, sample, and variables
- A sample is chosen to draw conclusions or answer questions about a population. Example in the transcript: 400 students as the sample.
- Each student is a unit of observation; the individuals are numbered (student 1, student 2, …, student 400).
- Variables are the characteristics measured or observed across the units (e.g., religion, country/state of origin, calculus course performance).
- The first column labeled “student” is an identifier, not a variable. It is used to organize data, not to describe a population characteristic.
- Data collection method mentioned: surveys. After collecting data, researchers organize it in a table for analysis.
Key concepts: variables and their values
- Variables are the columns in a data table; each column represents a variable.
- A value (or data point) is the entry in a cell under that variable for a given unit.
- Value types are primarily numerical (quantitative) or categorical (qualitative).
- Example of numerical values in the dataset: scores, ZIP codes treated as numbers but not suitable for arithmetic operations.
- Example of non-numerical labels: ZIP codes often function as geographical labels, not numbers to be used in arithmetic.
- The difference between a value and a variable: a variable is a characteristic; a value is a specific measurement of that characteristic for a unit.
Numerical vs. categorical variables
- Numerical variables contain numbers and are suitable for mathematical operations (mean, median, etc.).
- Categorical variables contain labels or categories (names, labels, or codes) and may be numerical labels but function as categories.
- Example of numerical values: scores, height, time, weight, ZIP codes treated as labels rather than quantities for arithmetic.
- Example of categorical values: color, gender, ethnicity, academic program, and even player numbers used as labels rather than quantities.
- Variable naming considerations: names are arbitrary but should be descriptive; singular vs. plural naming is common (e.g., Score, AcademicProgram).
- If a variable name is a single letter (e.g., x), a note should accompany the table explaining what it stands for (e.g., x = score).
Types of numerical variables
- Discrete numerical variables: counts that appear as separate, distinct values (e.g., number of children in a family).
- Continuous numerical variables: measurements that can take any value within an interval (e.g., height, time, weight).
- Examples from the transcript:
- Number of children: discrete (0, 1, 2, 3, …), obtained by counting.
- Height: continuous, obtained by measurement with a ruler or measuring tape.
- Time to fall: continuous, measured with a stopwatch.
- Important distinction: discrete values come from counting; continuous values come from measuring.
Types of categorical values
- Categorical values can be nominal or ordinal.
- Nominal: categories without a natural order (e.g., color, gender, ethnicity).
- Ordinal: categories with a natural order (e.g., grades like A, B, C; or Likert scales).
- The transcript discusses how grades can be nominally represented as categories (e.g., A, B, C) with an implied order (A is higher than B), where an A also corresponds to a range (e.g., 90–100). This illustrates that ordinal categories can have associated numerical ranges.
- A note on grading: A grade category may represent a range of values, not a single numerical value.
- In some cases, numbers used as labels (e.g., player numbers, ZIP codes) function as categorical labels rather than quantitative values.
Organizing data: raw data vs organized data
- Raw data: the unorganized measurements collected from observations (e.g., a list of grades orScores).
- Organized data: data arranged for analysis, typically in a table with clearly defined variables.
- When focusing on a categorical variable, we create a frequency table (a distribution of categories).
Frequency table and proportions
- A frequency table lists each category (without repetition) and the count of observations in that category.
- Example structure: categories (A, B, C, D) with counts (fA, fB, fC, fD).
- Totals:
- Sample size: N = sum of all frequencies (e.g., N = fA + fB + fC + fD).
- Relative frequencies (proportions): $pi = \frac{fi}{N}$ with $\sumi pi = 1$ .
- Percentages can be derived from proportions: e.g., 20% for A if $p_A = 0.20\,.$
- The transcript notes: sums of frequencies equal the sample size, and sums of relative frequencies equal 1.
Graphical representations of data
- Frequency bar graph (for categorical data)
- Vertical axis represents frequency (count) or proportion depending on the axis labeling.
- Draw a bar for each category; bars are separated (no touching) to emphasize category distinctness.
- Histogram (for continuous data)
- Bars touch each other, reflecting the continuity of the variable (no gaps between categories).
- A histogram is used when the variable is continuous, not categorical.
- Key distinctions:
- Bar graph: used for categorical variables; bars do not touch.
- Histogram: used for numerical continuous variables; bars touch.
Practical example: weight data (Born babies file)
- Weight in pounds is a continuous numerical variable.
- Categories are not given; we must construct them by choosing class intervals (bins).
- Decision factor: number of bars (bins) to display. The transcript shows a scenario with eight bars.
- Class width (class length): the width of each bin; expressed as w.
- A formulaic approach is described for determining w:
- Class length (width) w is tied to the range of data and the desired number of classes: $w \approx \frac{\text{max} - \text{min}}{k}$ , where k is the number of classes.
- In the example, with a minimum value of 5.6 and eight bars, the computed width is about $w \approx 0.5$ (rounded to the nearest tenth).
- Constructing the histogram bins:
- Start at the lowest value: 5.6.
- Class boundaries example: [5.6, 6.1), [6.1, 6.6), [6.6, 7.1), etc.
- Boundaries are inclusive at the lower bound and exclusive at the upper bound, except the final bin may include the maximum.
- Frequencies per bin are obtained by counting how many data points fall into each interval.
- Note: The exact frequencies per bin are not given in the transcript, but the method to determine them is outlined.
Additional notes on data semantics and best practices
- The “name of a variable” is arbitrary but should be descriptive; example: AcademicProgram as a variable name with descriptive labeling rather than an abstract code.
- When using symbolic names (e.g., X) for a variable, provide a legend or note clarifying what X represents (e.g., X means Score).
- The transcription emphasizes that some numeric-looking identifiers (like ZIP codes or player numbers) behave as categorical labels rather than numerical values to be manipulated arithmetically.
- Understanding the natural order: ordinal data have a natural ranking; nominal data do not.
Quick references and formulas (summary)
- Mean (average) for numerical data: $\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi$
- Proportions from a frequency table: $pi = \frac{fi}{N}, \quad \sumi pi = 1$
- Sample size from frequencies: $N = \sumi fi$
- Histogram bin width approximation: $w \approx \frac{\text{max} - \text{min}}{k}$
- Class boundaries concept: [min, min+w), [min+w, min+2w), …
Connections to broader statistics concepts
- Distinguishing variable types is foundational for choosing appropriate descriptive statistics and visualization (means/medians for numerical data; frequencies and mode for categorical data).
- Understanding raw data vs organized data prepares you for data cleaning, transformation, and analysis workflows.
- Choice of bin width in histograms affects interpretability and should reflect data distribution; rule-of-thumb as shown in the example demonstrates practical data-driven binning.
Quick practice prompts (to test understanding)
- Identify which of the following are variables and classify them as numerical or categorical: height, ZIP code, gender, academic program, student ID.
- For a categorical dataset with categories A, B, C, D and frequencies 4, 8, 5, 3, compute N, the proportions, and the percentages.
- Explain why ZIP code is treated as a non-numerical value for mathematical operations even though it consists of digits.
- Given a continuous variable with min = 5.6, max = 9.6, and you want 8 bins, compute the bin width and write the first three bin boundaries.
- Distinguish between when you would use a bar graph vs a histogram for a given dataset.