Chapter 1 Notes: Variables, Data Types, and Graphical Representations

  • Overview: population, sample, and variables

    • A sample is chosen to draw conclusions or answer questions about a population. Example in the transcript: 400 students as the sample.

    • Each student is a unit of observation; the individuals are numbered (student 1, student 2, …, student 400).

    • Variables are the characteristics measured or observed across the units (e.g., religion, country/state of origin, calculus course performance).

    • The first column labeled “student” is an identifier, not a variable. It is used to organize data, not to describe a population characteristic.

    • Data collection method mentioned: surveys. After collecting data, researchers organize it in a table for analysis.

  • Key concepts: variables and their values

    • Variables are the columns in a data table; each column represents a variable.

    • A value (or data point) is the entry in a cell under that variable for a given unit.

    • Value types are primarily numerical (quantitative) or categorical (qualitative).

    • Example of numerical values in the dataset: scores, ZIP codes treated as numbers but not suitable for arithmetic operations.

    • Example of non-numerical labels: ZIP codes often function as geographical labels, not numbers to be used in arithmetic.

    • The difference between a value and a variable: a variable is a characteristic; a value is a specific measurement of that characteristic for a unit.

  • Numerical vs. categorical variables

    • Numerical variables contain numbers and are suitable for mathematical operations (mean, median, etc.).

    • Categorical variables contain labels or categories (names, labels, or codes) and may be numerical labels but function as categories.

    • Example of numerical values: scores, height, time, weight, ZIP codes treated as labels rather than quantities for arithmetic.

    • Example of categorical values: color, gender, ethnicity, academic program, and even player numbers used as labels rather than quantities.

    • Variable naming considerations: names are arbitrary but should be descriptive; singular vs. plural naming is common (e.g., Score, AcademicProgram).

    • If a variable name is a single letter (e.g., x), a note should accompany the table explaining what it stands for (e.g., x = score).

  • Types of numerical variables

    • Discrete numerical variables: counts that appear as separate, distinct values (e.g., number of children in a family).

    • Continuous numerical variables: measurements that can take any value within an interval (e.g., height, time, weight).

    • Examples from the transcript:

    • Number of children: discrete (0, 1, 2, 3, …), obtained by counting.

    • Height: continuous, obtained by measurement with a ruler or measuring tape.

    • Time to fall: continuous, measured with a stopwatch.

    • Important distinction: discrete values come from counting; continuous values come from measuring.

  • Types of categorical values

    • Categorical values can be nominal or ordinal.

    • Nominal: categories without a natural order (e.g., color, gender, ethnicity).

    • Ordinal: categories with a natural order (e.g., grades like A, B, C; or Likert scales).

    • The transcript discusses how grades can be nominally represented as categories (e.g., A, B, C) with an implied order (A is higher than B), where an A also corresponds to a range (e.g., 90–100). This illustrates that ordinal categories can have associated numerical ranges.

    • A note on grading: A grade category may represent a range of values, not a single numerical value.

    • In some cases, numbers used as labels (e.g., player numbers, ZIP codes) function as categorical labels rather than quantitative values.

  • Organizing data: raw data vs organized data

    • Raw data: the unorganized measurements collected from observations (e.g., a list of grades orScores).

    • Organized data: data arranged for analysis, typically in a table with clearly defined variables.

    • When focusing on a categorical variable, we create a frequency table (a distribution of categories).

  • Frequency table and proportions

    • A frequency table lists each category (without repetition) and the count of observations in that category.

    • Example structure: categories (A, B, C, D) with counts (fA, fB, fC, fD).

    • Totals:

    • Sample size: N = sum of all frequencies (e.g., N = fA + fB + fC + fD).

    • Relative frequencies (proportions): pi = \frac{fi}{N} with \sumi pi = 1.

    • Percentages can be derived from proportions: e.g., 20% for A if p_A = 0.20\,.

    • The transcript notes: sums of frequencies equal the sample size, and sums of relative frequencies equal 1.

  • Graphical representations of data

    • Frequency bar graph (for categorical data)

    • Vertical axis represents frequency (count) or proportion depending on the axis labeling.

    • Draw a bar for each category; bars are separated (no touching) to emphasize category distinctness.

    • Histogram (for continuous data)

    • Bars touch each other, reflecting the continuity of the variable (no gaps between categories).

    • A histogram is used when the variable is continuous, not categorical.

    • Key distinctions:

    • Bar graph: used for categorical variables; bars do not touch.

    • Histogram: used for numerical continuous variables; bars touch.

  • Practical example: weight data (Born babies file)

    • Weight in pounds is a continuous numerical variable.

    • Categories are not given; we must construct them by choosing class intervals (bins).

    • Decision factor: number of bars (bins) to display. The transcript shows a scenario with eight bars.

    • Class width (class length): the width of each bin; expressed as w.

    • A formulaic approach is described for determining w:

    • Class length (width) w is tied to the range of data and the desired number of classes: w \approx \frac{\text{max} - \text{min}}{k}, where k is the number of classes.

    • In the example, with a minimum value of 5.6 and eight bars, the computed width is about w \approx 0.5 (rounded to the nearest tenth).

    • Constructing the histogram bins:

    • Start at the lowest value: 5.6.

    • Class boundaries example: [5.6, 6.1), [6.1, 6.6), [6.6, 7.1), etc.

    • Boundaries are inclusive at the lower bound and exclusive at the upper bound, except the final bin may include the maximum.

    • Frequencies per bin are obtained by counting how many data points fall into each interval.

    • Note: The exact frequencies per bin are not given in the transcript, but the method to determine them is outlined.

  • Additional notes on data semantics and best practices

    • The “name of a variable” is arbitrary but should be descriptive; example: AcademicProgram as a variable name with descriptive labeling rather than an abstract code.

    • When using symbolic names (e.g., X) for a variable, provide a legend or note clarifying what X represents (e.g., X means Score).

    • The transcription emphasizes that some numeric-looking identifiers (like ZIP codes or player numbers) behave as categorical labels rather than numerical values to be manipulated arithmetically.

    • Understanding the natural order: ordinal data have a natural ranking; nominal data do not.

  • Quick references and formulas (summary)

    • Mean (average) for numerical data: \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi

    • Proportions from a frequency table: pi = \frac{fi}{N}, \quad \sumi pi = 1

    • Sample size from frequencies: N = \sumi fi

    • Histogram bin width approximation: w \approx \frac{\text{max} - \text{min}}{k}

    • Class boundaries concept: [min, min+w), [min+w, min+2w), …

  • Connections to broader statistics concepts

    • Distinguishing variable types is foundational for choosing appropriate descriptive statistics and visualization (means/medians for numerical data; frequencies and mode for categorical data).

    • Understanding raw data vs organized data prepares you for data cleaning, transformation, and analysis workflows.

    • Choice of bin width in histograms affects interpretability and should reflect data distribution; rule-of-thumb as shown in the example demonstrates practical data-driven binning.

  • Quick practice prompts (to test understanding)

    • Identify which of the following are variables and classify them as numerical or categorical: height, ZIP code, gender, academic program, student ID.

    • For a categorical dataset with categories A, B, C, D and frequencies 4, 8, 5, 3, compute N, the proportions, and the percentages.

    • Explain why ZIP code is treated as a non-numerical value for mathematical operations even though it consists of digits.

    • Given a continuous variable with min = 5.6, max = 9.6, and you want 8 bins, compute the bin width and write the first three bin boundaries.

    • Distinguish between when you would use a bar graph vs a histogram for a given dataset.