Data Visualization & Central Tendency

Creation: Select dataset (e.g., CSUF), choose variable (e.g., fast NPH), set bin color, type (frequency), value label (counts), and add a density curve.
Bin Adjustment: Change the number of bins in the 'Details' section.
Shape Identification: Observe the overall shape (e.g., bell-shaped, approximate normal).
Group Separation: Use 'Plot by group' for categorical variables (e.g., 'sex') to create separate histograms.
Label Customization: Use the 'Level editor' to modify labels for grouped plots.
Relative Frequency: Choose 'relative thickness' to display percentages instead of raw counts.
Multiple Histograms: When separating by group, adjust 'rows' and 'columns' for layout, not available for 'overlay'.
Uniform Limits: Use uniform x and y limits for better comparison between grouped plots.

Definition: A display for quantitative variables that shows actual numerical values within data ranges.
Advantages: Shows individual data points.
Disadvantages: Not suitable for very large datasets, ideally fits on one page.
Construction:
- Stem: Consists of the leftmost digits of a number.
- Leaf: Consists of the rightmost digit(s) of a number.
- Ordering: Data must be ordered from lowest to highest before plotting.
- Lead Digit Unit (LDU): Essential to specify the place value of the leaf ( $e.g.,$ $1$ for ones, $0.1$ for tenths) to correctly interpret the displayed numbers.
- Gaps: Stems with no corresponding data values (leaves) should still be included to show gaps in the data distribution.
- Scaling: 'Scale 2' can be used to split stems (e.g., $6, 6, 7, 7$ ) when a single stem has too many leaves, improving readability.
- Orientation: Can be changed (e.g., to 'upward') to better visualize the distribution shape.

Mean: The arithmetic average of all data points.
- Population Mean: Denoted by $\mu$ .
- Sample Mean: Denoted by $\bar{x} = \frac{\sum xi}{n}$ , where $\sum xi$ is the sum of all values and $n$ is the sample size.
- Influence: Highly affected by extreme values (outliers); it is 'dragged' towards them.
Median: The middle value of an ordered dataset.
- Calculation: If an odd number of values, it's the single middle value. If an even number of values, it's the average of the two middle values.
- Robustness: Not influenced by extreme values or outliers.
- Comparison: If mean $\approx$ median, the data is relatively symmetrical. A significant difference suggests skewness or outliers.

Definition: A set of five key values that summarize the distribution of a dataset.
Components:
- Minimum: The smallest value in the dataset.
- First Quartile ( $Q1$ or Lower Quartile, $QL$ ): The median of the lower half of the ordered dataset (excluding the overall median for odd-sized data).
- Median ( $M$ or $Q_2$ ): The middle value of the ordered dataset.
- Third Quartile ( $Q3$ or Upper Quartile, $QU$ ): The median of the upper half of the ordered dataset (excluding the overall median for odd-sized data).
- Maximum: The largest value in the dataset.
Interpretation: These five numbers divide the data into four quarters, with each section representing $25\%$ of the data values.

Definition: A graphical display of the five-number summary and outliers.
Display of Five-Number Summary:
- Bottom Whisker: Represents the minimum value (excluding outliers).
- Bottom of the Box: Represents the first quartile ( $Q_1$ ).
- Line inside the Box: Represents the median.
- Top of the Box: Represents the third quartile ( $Q_3$ ).
- Top Whisker: Represents the maximum value (excluding outliers).
- Outliers: Shown as individual points (e.g., circles) outside the whiskers. Specific values can be displayed in software.
Mean: Can be optionally displayed as a diamond shape on the box plot for comparison with the median.
Group Comparison: Multiple box plots can be generated side-by-side (e.g., by a categorical variable like 'sex') to compare distributions and identify differences in center, spread, and outliers between groups.