Chapter 1 Looking at Data—Distributions

1.1 Data

Definition of Data: The fundamental objects described by a dataset are called cases, representing the "who" in the data. These cases can include various entities such as:
- Customers
- Companies
- Subjects in a study
- Units in an experiment
- Other objects, each providing a row in a typical dataset structure.
Variable: A variable is defined as a special characteristic attributed to a case, representing the "what" of the data. Different cases within the dataset can possess differing values of that variable. Variables can be broadly classified as:
- Categorical Variables: Place a case into one of several groups or categories (e.g., gender, eye color).
- Quantitative Variables: Take numerical values for which arithmetic operations like adding and averaging make sense (e.g., height, income, number of items bought).
Value: Represents the actual numerical measurement or categorical assignment recorded for a variable associated with a specific case.
Label: A unique identifier that is a variable used in certain data sets to differentiate amongst various cases, often serving as the primary key.

1.2 Displaying Distributions with Graphs

Exploratory Data Analysis

The process of examining data begins with exploring each variable in isolation (its distribution) and then analyzing relationships among them. This often involves asking "why" the data were collected and "how" they were collected.
Initiate your investigation with graphical representations to reveal patterns and deviations (outliers) before incorporating numerical summaries to elucidate specific characteristics of the data. The goal is to describe the overall pattern and notable deviations.

Graphical Tools for Data:

For Categorical Variables: Used to display the distribution, showing the count or percent of cases in each category.
- Bar Graphs: Represent categories with bars whose heights reflect counts (frequencies) or percentages (relative frequencies). They are useful for comparing quantities across different categories.
- Pie Charts: Illustrate the distribution as a pie, where slice sizes correspond to counts or percentages of categories. They are best used when emphasizing a category's relation to the whole, but become difficult to read with too many categories.
For Quantitative Variables: Used to display the shape, center, and spread of the distribution, and to identify any outliers.
- Histograms: Utilize bars to exhibit the distribution, where the height of each bar demonstrates the number or proportion of cases within a particular value range (bins). They show the overall shape of the distribution, including symmetry or skewness.
- Stemplots (Stem-and-Leaf Plots): A simple graphical display for small to moderate-sized quantitative datasets that shows the shape of the distribution while retaining the individual data values.
- Time Plots: Plots each observation against the time at which it was measured. Time is always on the horizontal axis, and the variable is on the vertical axis. They are used to visualize trends or patterns over time.

1.3 Describing Distributions with Numbers

Measuring Center and Spread

Mean \bar{x}: The arithmetic average, calculated as the sum of all observations divided by the number of observations (n). It is sensitive to outliers and skewness, pulled in the direction of the tail of a skewed distribution.
\bar{x} = \frac{\sum{i=1}^{n} xi}{n}
Median (M): The middle value in an ordered dataset. It is resistant to outliers and skewness, providing a better measure of center for skewed distributions.
- Arrange observations in increasing order.
- If n (number of observations) is odd, the median is the central observation at position (n+1)/2.
- If n is even, the median is the average of the two central observations at positions n/2 and (n/2)+1.

Measuring Spread

Quartiles: Indicate how data is spread around the center by dividing the ordered data into four equal parts.
- First Quartile (Q1): The median of the lower half of the data (25th percentile).
- Third Quartile (Q3): The median of the upper half of the data (75th percentile).
- Interquartile Range (IQR): The distance between the first and third quartiles, IQR = Q3 - Q1. It measures the spread of the middle 50% of the data and is resistant to outliers.
Five-Number Summary: Comprised of the smallest observation (Minimum), the first quartile (Q1), median (M), third quartile (Q3), and the largest observation (Maximum). It serves to offer a quick summary of both center and spread of the data set, providing a robust overview suitable for skewed distributions or those with outliers.
- Minimum, Q1, M, Q3, Maximum
Boxplot: Visually displays the five-number summary and outliers. It is constructed by:
- Drawing a number line that covers the entire range of the data.
- Depicting a central box from Q1 to Q3, with the median (M) noted within the box.
- Extending lines (whiskers) from the box to the non-outlier minimum and maximum values. Outliers are typically identified as observations falling more than 1.5 \times IQR below Q1 or above Q3, and are plotted individually.

1.4 Density Curves and Normal Distributions

Density Curves

Definition: A density curve is a graphical model that represents the probabilities of a continuous quantitative variable. It is a smoothed representation of the distribution shape, ensuring it is:
- Always above the horizontal axis (values are non-negative).
- Possesses an area of exactly 1 beneath it, representing the total probability of all possible outcomes (100%).
Mean and Median Comparison:
- The median is the equal-areas point, dividing the area under the curve in half (50% on each side).
- The mean represents the balance point of the curve; if the curve were a solid material, this is where it would balance. In symmetric density curves, both the mean and median coincide. For skewed distributions, the mean is pulled towards the longer tail, while the median remains a more central measure.

Normal Distributions

Defined by Normal density curves that are symmetric, single-peaked, and bell-shaped. Normal distributions are a particularly important class of density curves, as many natural phenomena and statistical methods rely on them. A Normal distribution is precisely identified by its mean (\mu), which specifies the center, and its standard deviation (\sigma), which specifies the spread.
Expressed as N(\mu, \sigma), where \mu is the population mean and \sigma is the population standard deviation.
The 68-95-99.7 rule (also known as the Empirical Rule) states that in a Normal distribution:
- Approximately 68% of observations fall within 1 standard deviation of the mean (\mu \pm \sigma).
- Approximately 95% fall within 2 standard deviations of the mean (\mu \pm 2\sigma).
- Approximately 99.7% are within 3 standard deviations of the mean (\mu \pm 3\sigma).
Z-score Transformation: The z-score standardizes an observation's value (x) by indicating how many standard deviations it is away from the mean (\mu). A positive z-score means the value is above the mean, and a negative z-score means it is below. The standard Normal distribution has \mu=0 and \sigma=1.
z = \frac{x - \mu}{\sigma}
Standard Normal Table: A single reference table (Table A) can be used to find areas under any Normal curve. By transforming any observation (x) into its z-score, we can use the standard Normal distribution to find the proportion of observations less than (or greater than) that value.

Normal Calculations

Steps to address problems with Normal Distributions:
1. State the Problem: Define the problem in terms of the variable x, clearly identifying what proportion or value is being sought.
2. Standardize: Graph the distribution, shading the area relevant to the inquiry, then convert the x value(s) to z-score(s) using the formula z = \frac{x - \mu}{\sigma}.
3. Use the Table: Use the Standard Normal Table (Table A) to find the area corresponding to the z-score(s).
4. Conclude: State the answer in the context of the original variable x, ensuring it addresses the initial problem.

Normal Quantile Plots

Normal quantile plots are a diagnostic tool used to assess how closely a dataset matches a Normal distribution. If the points on the plot lie close to a straight line, it indicates a good fit to a Normal distribution. Deviations from a straight-line pattern suggest non-Normality (e.g., skewness or heavy/light tails). Points that deviate significantly from the line represent potential outliers.

Density Estimation

Modern statistical software can create density estimators (e.g., kernel density estimatators) that do not assume any specific distribution shape. These estimators provide a flexible, data-driven approach to summarizing data characteristics without reliance on traditional parametric models like Normal distributions, offering a more adaptive way to visualize complex distributions.

Conclusion of Chapter

Understanding and effectively utilizing distributions is paramount to analyzing data successfully. The chapter provided various techniques and measures essential for exploring, displaying, and interpreting both categorical and quantitative distributions, equipping students with the foundational knowledge required for deeper statistical practices. Mastery of these concepts forms the basis for infer