Exploring Data with Graphs and Numerical Summaries

A variable is a characteristic recorded for subjects in a study.
Variables can be:
- Categorical: Observations belong to categories (e.g., gender, religion).
- Quantitative: Observations take numerical values.
  - Discrete: Possible values are separate numbers (e.g., number of pets).
  - Continuous: Possible values form an interval (e.g., height, age).

Mean: Sum of observations divided by the number of observations.
- Python: np.mean(X)
Median: Midpoint of ordered observations.
- Odd number of observations: middle value.
- Even number of observations: average of the two middle values.
- Python: np.median(X)
Mode: Value that occurs most often.
- Python: st.mode(X)

Symmetric distribution: Mean and median are close; mean is preferred.
Skewed distribution: Median is preferred as it better represents a typical observation.

Range: max - min
- Advantage: simple description of the spreadness of the data
- Disadvantage: The range is strongly affected by outliers.
- Python: np.max(X)-np.min(X)
Standard Deviation: Measures variation by summarizing deviations from the mean.
1. Find mean
2. Find each deviation
3. Square deviations
4. Sum squared deviations
5. Divide sum by n-1
6. Take square root
- Python: np.std(X)

$P_k$ is a value where k percent of observations are less than or equal to that value.
Quartiles:
- Q1: 25th percentile.
- Q2: 50th percentile (median).
- Q3: 75th percentile.
Finding Quartiles
1. Arrange data in order
2. The median is the second quartile, Q2
3. Q1 is the median of the lower half of the observations
4. Q3 is the median of the upper half of the observations

Proportion & Percentage (Relative Frequency):
Frequency Table: listing of possible values for a variable, together with the number of observations or relative frequencies for each value.
Frequency tables list values and their frequencies.
- Python: Use pandas to calculate absolute and relative frequencies.

Pie Charts:
- Represent categorical data as slices of a circle.
Bar Graphs:
- Vertical bars represent counts or percentages for each category.
- Pareto Charts: ordered from tallest to shortest
  - Python: Use seaborn.
Histograms:
- Bars show frequencies or relative frequencies for quantitative variables.
- Python: Use seaborn.
Interpreting Histograms
- Median: Assess where a distribution is centered by finding the median
- Spread: Assess the spread of a distribution
- Shape: roughly symmetric, skewed to the right, or skewed to the left
Boxplots:
- Display the distribution of data based on the five-number summary
  1. go from the Q1 to Q3
  2. Line is drawn inside the box at the median
  3. Line goes from lower end of box (Q1) to smallest observation not a potential outlier
  4. Line goes from upper end of box (Q3) to largest observation not a potential outlier
  5. Potential outliers are shown separately, often with * or +
  - Python: Use seaborn.

Data Normalization: Scaling variables to the same range (e.g., [0, 1]).
- Z-Scores: Measures how many standard deviations an element is from the mean.
  - Python: preprocessing.StandardScaler().
- Min-Max Scaling: Scales data to a [0, 1] range.
  - Python: preprocessing.MinMaxScaler().
Discretization: Converting continuous variables into discrete values (binning).
- Example: Converting GPA to bins (0-1, 1-2, 2-3, 3-4).
- Python: preprocessing.KBinsDiscretizer().
Encoding Categorical Features: Converting categorical data into numerical form.
- Ordinal Encoding: Assigning integers based on order.
  - Python: preprocessing.OrdinalEncoder().
- Label Encoders (similar to ordinal encoders but for 1-D row arrays).
- One-Hot Encoding: Creating binary columns for each category.

Analyzing the Iris dataset with sepal and petal measurements for three species.
Steps:
1. Read the data using pandas.
2. Calculate numerical summaries (value_counts(), describe()).
3. Create visual summaries (histograms, box plots, scatter plots, pair plots, heatmaps) using matplotlib and seaborn.
4. Filter a DataFrame. data[(data.Class == "Iris-setosa")]

Importance of AI and Data Science for society
Perform data loading, preprocessing, summarization and visualization
Apply machine learning methods to solve basic regression and classification problems
Apply artificial neural networks to solve simple engineering problems
Implement basic data science and machine learning tasks using programming tools.