Exploring Data with Graphs and Numerical Summaries

Introduction to Statistics using Python

  • A variable is a characteristic recorded for subjects in a study.

  • Variables can be:

    • Categorical: Observations belong to categories (e.g., gender, religion).

    • Quantitative: Observations take numerical values.

      • Discrete: Possible values are separate numbers (e.g., number of pets).

      • Continuous: Possible values form an interval (e.g., height, age).

Describing Quantitative Data

  • Mean: Sum of observations divided by the number of observations.

    • Python: np.mean(X)

  • Median: Midpoint of ordered observations.

    • Odd number of observations: middle value.

    • Even number of observations: average of the two middle values.

    • Python: np.median(X)

  • Mode: Value that occurs most often.

    • Python: st.mode(X)

Comparing the Mean and Median

  • Symmetric distribution: Mean and median are close; mean is preferred.

  • Skewed distribution: Median is preferred as it better represents a typical observation.

Describing the Spread of Quantitative Data

  • Range: max - min

    • Advantage: simple description of the spreadness of the data

    • Disadvantage: The range is strongly affected by outliers.

    • Python: np.max(X)-np.min(X)

  • Standard Deviation: Measures variation by summarizing deviations from the mean.

    1. Find mean

    2. Find each deviation

    3. Square deviations

    4. Sum squared deviations

    5. Divide sum by n-1

    6. Take square root

    • Python: np.std(X)

Measures of Position: Percentiles and Quartiles

  • P_k is a value where k percent of observations are less than or equal to that value.

  • Quartiles:

    • Q1: 25th percentile.

    • Q2: 50th percentile (median).

    • Q3: 75th percentile.

  • Finding Quartiles

    1. Arrange data in order

    2. The median is the second quartile, Q2

    3. Q1 is the median of the lower half of the observations

    4. Q3 is the median of the upper half of the observations

Five-Number Summary

  1. Minimum value

  2. First Quartile

  3. Median

  4. Third Quartile

  5. Maximum value

Describing Categorical Variables

  • Proportion & Percentage (Relative Frequency):
    Frequency Table: listing of possible values for a variable, together with the number of observations or relative frequencies for each value.

  • Frequency tables list values and their frequencies.

    • Python: Use pandas to calculate absolute and relative frequencies.

Graphical Summaries

  • Pie Charts:

    • Represent categorical data as slices of a circle.

  • Bar Graphs:

    • Vertical bars represent counts or percentages for each category.

    • Pareto Charts: ordered from tallest to shortest

      • Python: Use seaborn.

  • Histograms:

    • Bars show frequencies or relative frequencies for quantitative variables.

    • Python: Use seaborn.

  • Interpreting Histograms

    • Median: Assess where a distribution is centered by finding the median

    • Spread: Assess the spread of a distribution

    • Shape: roughly symmetric, skewed to the right, or skewed to the left

  • Boxplots:

    • Display the distribution of data based on the five-number summary

      1. go from the Q1 to Q3

      2. Line is drawn inside the box at the median

      3. Line goes from lower end of box (Q1) to smallest observation not a potential outlier

      4. Line goes from upper end of box (Q3) to largest observation not a potential outlier

      5. Potential outliers are shown separately, often with * or +

      • Python: Use seaborn.

Comparing Distributions

  • Boxplots are useful for comparing datasets.

Data Preprocessing

  • Data Normalization: Scaling variables to the same range (e.g., [0, 1]).

    • Z-Scores: Measures how many standard deviations an element is from the mean.

      • Python: preprocessing.StandardScaler().

    • Min-Max Scaling: Scales data to a [0, 1] range.

      • Python: preprocessing.MinMaxScaler().

  • Discretization: Converting continuous variables into discrete values (binning).

    • Example: Converting GPA to bins (0-1, 1-2, 2-3, 3-4).

    • Python: preprocessing.KBinsDiscretizer().

  • Encoding Categorical Features: Converting categorical data into numerical form.

    • Ordinal Encoding: Assigning integers based on order.

      • Python: preprocessing.OrdinalEncoder().

    • Label Encoders (similar to ordinal encoders but for 1-D row arrays).

    • One-Hot Encoding: Creating binary columns for each category.

Case Study: Iris Dataset

  • Analyzing the Iris dataset with sepal and petal measurements for three species.

  • Steps:

    1. Read the data using pandas.

    2. Calculate numerical summaries (value_counts(), describe()).

    3. Create visual summaries (histograms, box plots, scatter plots, pair plots, heatmaps) using matplotlib and seaborn.

    4. Filter a DataFrame. data[(data.Class == "Iris-setosa")]

Learning Outcomes

  • Importance of AI and Data Science for society

  • Perform data loading, preprocessing, summarization and visualization

  • Apply machine learning methods to solve basic regression and classification problems

  • Apply artificial neural networks to solve simple engineering problems

  • Implement basic data science and machine learning tasks using programming tools.