Chapter 2: Variables and Histograms
Chapter 2: Variables and Histograms
Section 1: Variables and Values
Foundational Definitions:
Variable: Any characteristic, number, or quantity that can be measured or counted and can vary from one member of a population to another.
Statistical Importance: Variables are the building blocks of data analysis; understanding their nature determines which statistical tests are applicable.
Example 1: Height - a quantitative variable that encompasses a range of values (e.g., individual A = 64 inches, B = 67.5 inches).
Example 2: Hair Color - a qualitative variable representing distinct categories.
Values: The specific observation recorded for a variable for a particular individual.
Units of Observation: The individuals or objects from which data is collected (e.g., people, cars, cities).
Data Organization (The Data Sheet):
Structure: Data is typically organized in a rectangular grid.
Rows: Represent individual cases or observations. Each row corresponds to one unique entity.
Columns: Represent specific variables. Each column contains the data for one attribute across all individuals.
Table 1a Expanded:
Individual
Height ($X$)
Hair Color ($Y$)
No. of Pets ($W$)
Sam
67
Brown
0
Mariana
63
Brown
4
Morgan
65
Black
2
Ricardo
69
Red
1
Variable Notation and Indexing:
Uppercase Letters: Used to denote the variable itself (e.g., X for Height).
Lowercase Letters with Subscripts: Used to denote the specific value of the i-th individual.
x_1 = 67: The height of the first individual.
y_2 = \text{"Brown"}: The hair color of the second individual.
w_n: The value of the n-th individual on variable W.
Section 2: Types of Variables
Quantitative (Numeric) Variables:
Continuous Variables: Can take on any value within a range. They are measured rather than counted.
Examples: Weight, temperature, or the exact time taken to complete a task. Between any two values (e.g., 1.1 and 1.2), there are infinite possible values (e.g., 1.15, 1.155, etc.).
Discrete Variables: Consist of isolated points on a number line, often resulting from counting.
Examples: Number of children in a family, number of cars in a parking lot. There are no values between 2 and 3; you cannot have 2.4 children.
Qualitative (Categorical) Variables:
Nominal Variables: Categories with no intrinsic ordering or ranking.
Examples: Gender, blood type, or zip codes (even though these are numbers, they represent locations, not quantities).
Ordinal Variables: Categories that have a logical order or rank, but the "distance" between categories is not quantifiable.
Examples: Socioeconomic status (Low, Middle, High), or survey responses like "Satisfied, Neutral, Dissatisfied."
Special Classifications:
Dichotomous (Binary) Variables: A variable with exactly two levels.
Types: Categorical (e.g., Yes/No, Pass/Fail) or Dummy/Indicator variables (e.g., 0 and 1).
Section 3: Frequency Tables and Bar Graphs
Summarizing Categorical Data:
Frequency ($f$): The number of times a particular value occurs in a dataset.
Relative Frequency: The proportion of the total observations that belong to a category.
Formula: RF = \frac{f}{n}, where n is the total sample size.
Cumulative Frequency: The sum of frequencies up to a certain point (useful for ordinal data).
Visualizing Categorical Data:
Bar Graphs: A graphical representation where the length of each bar is proportional to the frequency or relative frequency.
Gaps between bars: Essential in bar graphs to signify that the categories are discrete and not on a continuous scale.
Section 4: Frequency Histograms
Definition: A histogram is used for numeric data. It groups values into "bins" or "intervals."
Class Intervals (Bins):
The range of values must be divided into equal, non-overlapping intervals.
Example: If measuring age, bins might be 0-9, 10-19, etc.
Rules for Bars:
No Gaps: Bars must touch because the underlying scale is continuous.
Frequency Alignment: The height of the bar represents the frequency of observations within that specific interval.
Section 5: Density Histograms
Purpose: Used when intervals are of unequal width or when comparing datasets of different sizes. The height indicates "density" rather than raw frequency.
The Area Principle: In a density histogram, the area of the bar (not just the height) represents the percentage of the data.
Calculations:
Interval Width: W = \text{Right Endpoint} - \text{Left Endpoint}
Percent in Interval: \text{Percent} = \left( \frac{f}{n} \right) \times 100
Density: \text{Density} = \frac{\text{Percent}}{W}
Total Area: The total area under the density histogram must equal 100\%
Section 6: Distributions
Distribution Shapes:
Normal (Gaussian) Distribution: Symmetric and bell-shaped; most values cluster around the central mean with symmetrical tails.
Skewness:
Positively Skewed (Right-Skewed): The data has a long tail pointing toward higher positive values. The "hump" is on the left.
Negatively Skewed (Left-Skewed): The data has a long tail pointing toward lower/negative values. The "hump" is on the right.
Uniform Distribution: All values have approximately the same frequency; the histogram looks like a flat rectangle.
Bimodal: The distribution has two distinct peaks, often suggesting the data comes from two different groups within a population.
Identifying Patterns: Distributions help statisticians identify trends, variability (spread), and the presence of Outliers (extreme values quite different from the rest of the data).