Chapter 2: Variables and Histograms

Chapter 2: Variables and Histograms
Section 1: Variables and Values
  • Foundational Definitions:

    • Variable: Any characteristic, number, or quantity that can be measured or counted and can vary from one member of a population to another.

    • Statistical Importance: Variables are the building blocks of data analysis; understanding their nature determines which statistical tests are applicable.

    • Example 1: Height - a quantitative variable that encompasses a range of values (e.g., individual A = 64 inches, B = 67.5 inches).

    • Example 2: Hair Color - a qualitative variable representing distinct categories.

    • Values: The specific observation recorded for a variable for a particular individual.

    • Units of Observation: The individuals or objects from which data is collected (e.g., people, cars, cities).

  • Data Organization (The Data Sheet):

    • Structure: Data is typically organized in a rectangular grid.

    • Rows: Represent individual cases or observations. Each row corresponds to one unique entity.

    • Columns: Represent specific variables. Each column contains the data for one attribute across all individuals.

    • Table 1a Expanded:

    Individual

    Height ($X$)

    Hair Color ($Y$)

    No. of Pets ($W$)

    Sam

    67

    Brown

    0

    Mariana

    63

    Brown

    4

    Morgan

    65

    Black

    2

    Ricardo

    69

    Red

    1

  • Variable Notation and Indexing:

    • Uppercase Letters: Used to denote the variable itself (e.g., X for Height).

    • Lowercase Letters with Subscripts: Used to denote the specific value of the i-th individual.

    • x_1 = 67: The height of the first individual.

    • y_2 = \text{"Brown"}: The hair color of the second individual.

    • w_n: The value of the n-th individual on variable W.

Section 2: Types of Variables
  • Quantitative (Numeric) Variables:

    1. Continuous Variables: Can take on any value within a range. They are measured rather than counted.

    • Examples: Weight, temperature, or the exact time taken to complete a task. Between any two values (e.g., 1.1 and 1.2), there are infinite possible values (e.g., 1.15, 1.155, etc.).

    1. Discrete Variables: Consist of isolated points on a number line, often resulting from counting.

    • Examples: Number of children in a family, number of cars in a parking lot. There are no values between 2 and 3; you cannot have 2.4 children.

  • Qualitative (Categorical) Variables:

    1. Nominal Variables: Categories with no intrinsic ordering or ranking.

    • Examples: Gender, blood type, or zip codes (even though these are numbers, they represent locations, not quantities).

    1. Ordinal Variables: Categories that have a logical order or rank, but the "distance" between categories is not quantifiable.

    • Examples: Socioeconomic status (Low, Middle, High), or survey responses like "Satisfied, Neutral, Dissatisfied."

  • Special Classifications:

    • Dichotomous (Binary) Variables: A variable with exactly two levels.

    • Types: Categorical (e.g., Yes/No, Pass/Fail) or Dummy/Indicator variables (e.g., 0 and 1).

Section 3: Frequency Tables and Bar Graphs
  • Summarizing Categorical Data:

    • Frequency ($f$): The number of times a particular value occurs in a dataset.

    • Relative Frequency: The proportion of the total observations that belong to a category.

    • Formula: RF = \frac{f}{n}, where n is the total sample size.

    • Cumulative Frequency: The sum of frequencies up to a certain point (useful for ordinal data).

  • Visualizing Categorical Data:

    • Bar Graphs: A graphical representation where the length of each bar is proportional to the frequency or relative frequency.

    • Gaps between bars: Essential in bar graphs to signify that the categories are discrete and not on a continuous scale.

Section 4: Frequency Histograms
  • Definition: A histogram is used for numeric data. It groups values into "bins" or "intervals."

  • Class Intervals (Bins):

    • The range of values must be divided into equal, non-overlapping intervals.

    • Example: If measuring age, bins might be 0-9, 10-19, etc.

  • Rules for Bars:

    1. No Gaps: Bars must touch because the underlying scale is continuous.

    2. Frequency Alignment: The height of the bar represents the frequency of observations within that specific interval.

Section 5: Density Histograms
  • Purpose: Used when intervals are of unequal width or when comparing datasets of different sizes. The height indicates "density" rather than raw frequency.

  • The Area Principle: In a density histogram, the area of the bar (not just the height) represents the percentage of the data.

  • Calculations:

    1. Interval Width: W = \text{Right Endpoint} - \text{Left Endpoint}

    2. Percent in Interval: \text{Percent} = \left( \frac{f}{n} \right) \times 100

    3. Density: \text{Density} = \frac{\text{Percent}}{W}

    • Total Area: The total area under the density histogram must equal 100\%

Section 6: Distributions
  • Distribution Shapes:

    1. Normal (Gaussian) Distribution: Symmetric and bell-shaped; most values cluster around the central mean with symmetrical tails.

    2. Skewness:

    • Positively Skewed (Right-Skewed): The data has a long tail pointing toward higher positive values. The "hump" is on the left.

    • Negatively Skewed (Left-Skewed): The data has a long tail pointing toward lower/negative values. The "hump" is on the right.

    1. Uniform Distribution: All values have approximately the same frequency; the histogram looks like a flat rectangle.

    2. Bimodal: The distribution has two distinct peaks, often suggesting the data comes from two different groups within a population.

  • Identifying Patterns: Distributions help statisticians identify trends, variability (spread), and the presence of Outliers (extreme values quite different from the rest of the data).