Data Visualization & Central Tendency
Histogram Basics repaso
Creation: Select dataset (e.g., CSUF), choose variable (e.g., fast NPH), set bin color, type (frequency), value label (counts), and add a density curve.
Bin Adjustment: Change the number of bins in the 'Details' section.
Shape Identification: Observe the overall shape (e.g., bell-shaped, approximate normal).
Group Separation: Use 'Plot by group' for categorical variables (e.g., 'sex') to create separate histograms.
Label Customization: Use the 'Level editor' to modify labels for grouped plots.
Relative Frequency: Choose 'relative thickness' to display percentages instead of raw counts.
Multiple Histograms: When separating by group, adjust 'rows' and 'columns' for layout, not available for 'overlay'.
Uniform Limits: Use uniform x and y limits for better comparison between grouped plots.
Stem-and-Leaf Plot
Definition: A display for quantitative variables that shows actual numerical values within data ranges.
Advantages: Shows individual data points.
Disadvantages: Not suitable for very large datasets, ideally fits on one page.
Construction:
Stem: Consists of the leftmost digits of a number.
Leaf: Consists of the rightmost digit(s) of a number.
Ordering: Data must be ordered from lowest to highest before plotting.
Lead Digit Unit (LDU): Essential to specify the place value of the leaf ( for ones, for tenths) to correctly interpret the displayed numbers.
Gaps: Stems with no corresponding data values (leaves) should still be included to show gaps in the data distribution.
Scaling: 'Scale 2' can be used to split stems (e.g., ) when a single stem has too many leaves, improving readability.
Orientation: Can be changed (e.g., to 'upward') to better visualize the distribution shape.
Measures of Center
Mean: The arithmetic average of all data points.
Population Mean: Denoted by .
Sample Mean: Denoted by , where is the sum of all values and is the sample size.
Influence: Highly affected by extreme values (outliers); it is 'dragged' towards them.
Median: The middle value of an ordered dataset.
Calculation: If an odd number of values, it's the single middle value. If an even number of values, it's the average of the two middle values.
Robustness: Not influenced by extreme values or outliers.
Comparison: If mean median, the data is relatively symmetrical. A significant difference suggests skewness or outliers.
Five-Number Summary
Definition: A set of five key values that summarize the distribution of a dataset.
Components:
Minimum: The smallest value in the dataset.
First Quartile ( or Lower Quartile, ): The median of the lower half of the ordered dataset (excluding the overall median for odd-sized data).
Median ( or ): The middle value of the ordered dataset.
Third Quartile ( or Upper Quartile, ): The median of the upper half of the ordered dataset (excluding the overall median for odd-sized data).
Maximum: The largest value in the dataset.
Interpretation: These five numbers divide the data into four quarters, with each section representing of the data values.
Box Plot (Box and Whisker Plot)
Definition: A graphical display of the five-number summary and outliers.
Display of Five-Number Summary:
Bottom Whisker: Represents the minimum value (excluding outliers).
Bottom of the Box: Represents the first quartile ().
Line inside the Box: Represents the median.
Top of the Box: Represents the third quartile ().
Top Whisker: Represents the maximum value (excluding outliers).
Outliers: Shown as individual points (e.g., circles) outside the whiskers. Specific values can be displayed in software.
Mean: Can be optionally displayed as a diamond shape on the box plot for comparison with the median.
Group Comparison: Multiple box plots can be generated side-by-side (e.g., by a categorical variable like 'sex') to compare distributions and identify differences in center, spread, and outliers between groups.