Data Visualization & Central Tendency

Histogram Basics repaso
  • Creation: Select dataset (e.g., CSUF), choose variable (e.g., fast NPH), set bin color, type (frequency), value label (counts), and add a density curve.

  • Bin Adjustment: Change the number of bins in the 'Details' section.

  • Shape Identification: Observe the overall shape (e.g., bell-shaped, approximate normal).

  • Group Separation: Use 'Plot by group' for categorical variables (e.g., 'sex') to create separate histograms.

  • Label Customization: Use the 'Level editor' to modify labels for grouped plots.

  • Relative Frequency: Choose 'relative thickness' to display percentages instead of raw counts.

  • Multiple Histograms: When separating by group, adjust 'rows' and 'columns' for layout, not available for 'overlay'.

  • Uniform Limits: Use uniform x and y limits for better comparison between grouped plots.

Stem-and-Leaf Plot

  • Definition: A display for quantitative variables that shows actual numerical values within data ranges.

  • Advantages: Shows individual data points.

  • Disadvantages: Not suitable for very large datasets, ideally fits on one page.

  • Construction:

    • Stem: Consists of the leftmost digits of a number.

    • Leaf: Consists of the rightmost digit(s) of a number.

    • Ordering: Data must be ordered from lowest to highest before plotting.

    • Lead Digit Unit (LDU): Essential to specify the place value of the leaf (e.g.,e.g., 11 for ones, 0.10.1 for tenths) to correctly interpret the displayed numbers.

    • Gaps: Stems with no corresponding data values (leaves) should still be included to show gaps in the data distribution.

    • Scaling: 'Scale 2' can be used to split stems (e.g., 6,6,7,76, 6, 7, 7) when a single stem has too many leaves, improving readability.

    • Orientation: Can be changed (e.g., to 'upward') to better visualize the distribution shape.

Measures of Center

  • Mean: The arithmetic average of all data points.

    • Population Mean: Denoted by μ\mu.

    • Sample Mean: Denoted by xˉ=x<em>in\bar{x} = \frac{\sum x<em>i}{n}, where x</em>i\sum x</em>i is the sum of all values and nn is the sample size.

    • Influence: Highly affected by extreme values (outliers); it is 'dragged' towards them.

  • Median: The middle value of an ordered dataset.

    • Calculation: If an odd number of values, it's the single middle value. If an even number of values, it's the average of the two middle values.

    • Robustness: Not influenced by extreme values or outliers.

    • Comparison: If mean \approx median, the data is relatively symmetrical. A significant difference suggests skewness or outliers.

Five-Number Summary

  • Definition: A set of five key values that summarize the distribution of a dataset.

  • Components:

    • Minimum: The smallest value in the dataset.

    • First Quartile (Q<em>1Q<em>1 or Lower Quartile, Q</em>LQ</em>L): The median of the lower half of the ordered dataset (excluding the overall median for odd-sized data).

    • Median (MM or Q2Q_2): The middle value of the ordered dataset.

    • Third Quartile (Q<em>3Q<em>3 or Upper Quartile, Q</em>UQ</em>U): The median of the upper half of the ordered dataset (excluding the overall median for odd-sized data).

    • Maximum: The largest value in the dataset.

  • Interpretation: These five numbers divide the data into four quarters, with each section representing 25%25\% of the data values.

Box Plot (Box and Whisker Plot)

  • Definition: A graphical display of the five-number summary and outliers.

  • Display of Five-Number Summary:

    • Bottom Whisker: Represents the minimum value (excluding outliers).

    • Bottom of the Box: Represents the first quartile (Q1Q_1).

    • Line inside the Box: Represents the median.

    • Top of the Box: Represents the third quartile (Q3Q_3).

    • Top Whisker: Represents the maximum value (excluding outliers).

    • Outliers: Shown as individual points (e.g., circles) outside the whiskers. Specific values can be displayed in software.

  • Mean: Can be optionally displayed as a diamond shape on the box plot for comparison with the median.

  • Group Comparison: Multiple box plots can be generated side-by-side (e.g., by a categorical variable like 'sex') to compare distributions and identify differences in center, spread, and outliers between groups.