Exploring Data with Graphs and Numerical Summaries
Introduction to Statistics using Python
A variable is a characteristic recorded for subjects in a study.
Variables can be:
Categorical: Observations belong to categories (e.g., gender, religion).
Quantitative: Observations take numerical values.
Discrete: Possible values are separate numbers (e.g., number of pets).
Continuous: Possible values form an interval (e.g., height, age).
Describing Quantitative Data
Mean: Sum of observations divided by the number of observations.
Python:
np.mean(X)
Median: Midpoint of ordered observations.
Odd number of observations: middle value.
Even number of observations: average of the two middle values.
Python:
np.median(X)
Mode: Value that occurs most often.
Python:
st.mode(X)
Comparing the Mean and Median
Symmetric distribution: Mean and median are close; mean is preferred.
Skewed distribution: Median is preferred as it better represents a typical observation.
Describing the Spread of Quantitative Data
Range: max - min
Advantage: simple description of the spreadness of the data
Disadvantage: The range is strongly affected by outliers.
Python:
np.max(X)-np.min(X)
Standard Deviation: Measures variation by summarizing deviations from the mean.
Find mean
Find each deviation
Square deviations
Sum squared deviations
Divide sum by n-1
Take square root
Python:
np.std(X)
Measures of Position: Percentiles and Quartiles
P_k is a value where k percent of observations are less than or equal to that value.
Quartiles:
Q1: 25th percentile.
Q2: 50th percentile (median).
Q3: 75th percentile.
Finding Quartiles
Arrange data in order
The median is the second quartile, Q2
Q1 is the median of the lower half of the observations
Q3 is the median of the upper half of the observations
Five-Number Summary
Minimum value
First Quartile
Median
Third Quartile
Maximum value
Describing Categorical Variables
Proportion & Percentage (Relative Frequency):
Frequency Table: listing of possible values for a variable, together with the number of observations or relative frequencies for each value.Frequency tables list values and their frequencies.
Python: Use
pandasto calculate absolute and relative frequencies.
Graphical Summaries
Pie Charts:
Represent categorical data as slices of a circle.
Bar Graphs:
Vertical bars represent counts or percentages for each category.
Pareto Charts: ordered from tallest to shortest
Python: Use
seaborn.
Histograms:
Bars show frequencies or relative frequencies for quantitative variables.
Python: Use
seaborn.
Interpreting Histograms
Median: Assess where a distribution is centered by finding the median
Spread: Assess the spread of a distribution
Shape: roughly symmetric, skewed to the right, or skewed to the left
Boxplots:
Display the distribution of data based on the five-number summary
go from the Q1 to Q3
Line is drawn inside the box at the median
Line goes from lower end of box (Q1) to smallest observation not a potential outlier
Line goes from upper end of box (Q3) to largest observation not a potential outlier
Potential outliers are shown separately, often with * or +
Python: Use
seaborn.
Comparing Distributions
Boxplots are useful for comparing datasets.
Data Preprocessing
Data Normalization: Scaling variables to the same range (e.g., [0, 1]).
Z-Scores: Measures how many standard deviations an element is from the mean.
Python:
preprocessing.StandardScaler().
Min-Max Scaling: Scales data to a [0, 1] range.
Python:
preprocessing.MinMaxScaler().
Discretization: Converting continuous variables into discrete values (binning).
Example: Converting GPA to bins (0-1, 1-2, 2-3, 3-4).
Python:
preprocessing.KBinsDiscretizer().
Encoding Categorical Features: Converting categorical data into numerical form.
Ordinal Encoding: Assigning integers based on order.
Python:
preprocessing.OrdinalEncoder().
Label Encoders (similar to ordinal encoders but for 1-D row arrays).
One-Hot Encoding: Creating binary columns for each category.
Case Study: Iris Dataset
Analyzing the Iris dataset with sepal and petal measurements for three species.
Steps:
Read the data using
pandas.Calculate numerical summaries (
value_counts(),describe()).Create visual summaries (histograms, box plots, scatter plots, pair plots, heatmaps) using
matplotlibandseaborn.Filter a DataFrame.
data[(data.Class == "Iris-setosa")]
Learning Outcomes
Importance of AI and Data Science for society
Perform data loading, preprocessing, summarization and visualization
Apply machine learning methods to solve basic regression and classification problems
Apply artificial neural networks to solve simple engineering problems
Implement basic data science and machine learning tasks using programming tools.