Data Science - Chapter 1: Use of Statistics in Data Science

A subset is a smaller set of data taken from a larger dataset for analysis.
It is a useful indexing feature for accessing object elements, selecting variables, and filtering observations.
Subsetting allows focusing on the required data by filtering out unnecessary content.
Example: Selecting the first 5 rows and 5 columns from a table of 100 rows and 100 columns to create a subset.

Row-based subsetting: Taking some rows from the top or bottom of the table. Example: Selecting the top 3 rows from a table of 6 rows and 4 columns.
Column-based subsetting: Selecting specific columns from the dataset. This is useful when the original dataset contains a large number of columns, and not all are necessary for analysis.
Data-based subsetting: Subsetting the data based on specific data values or conditions.

A two-way table is a statistical table that displays the observed number or frequency for two variables.
Rows indicate one category, and columns indicate the other category, showing how many data points fit in each category.
Example: A table showing the relationship between age groups ("5-10 years", "10-15 years", "15-20 years") and their preference for chocolates ("Like chocolates" or "Do not like chocolates").
Each cell in the table represents the number (or frequency) of people in that specific category.

Similar to a two-way frequency table, but it displays percentages instead of numbers.
Represents the percentage of data points that fit in each category.
Row relative frequencies or column relative frequencies can be used, depending on the context of the problem.
Helpful when there are different sample sizes in a dataset because percentages make it easier to compare preferences.

Mean is a measure of central tendency, also known as the simple average.
It is the average value of a dataset, around which the entire data is spread out.
All values are weighted equally when calculating the mean.
Calculation: Sum of all values in the dataset divided by the number of values.
Example: Finding the mean of the numbers 10 to 20 (11 numbers).
- Sum = 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 = 165
- Number of values = 11
- Mean = $\frac{165}{11} = 15$

The median is another measure of central tendency, representing the middle point of a sorted dataset.
To calculate the median, the dataset must be ordered in ascending or descending order.
If the data set is sorted from smallest value to biggest value, the exact middle value of the set is the Median
If the number of data points is odd, the median is the middle data point in the list.
- Example: Find the median of the given data set {1, 4, 2,5, 0}
  - Put the data in ascending order: 0, 1, 2, 4, 5
  - Median = 2
If the number of data points is even, the median is the average of the two middle data points in the list.
- Example: Find the median of the given data set {10, 40, 20, 50}
  - Put the data in ascending order: 10, 20, 40, 50
  - The two middle data points are 20 and 40
  - Median = $\frac{20 + 40}{2} = \frac{60}{2} = 30$

The median is a more accurate measure of central tendency when there are outliers or irregular values in the dataset.

Mean Absolute Deviation (MAD) is the average of how far away all values in a data set are from the mean.
Steps to calculate MAD:
1. Calculate the mean of the dataset.
2. Calculate the distance of each data point from the mean and take the absolute value (ignore the negative sign).
3. Calculate the mean of the distances.
MAD provides a good understanding of the variability or scatter of the data set.

Student Name	Maximum Marks %	Mean	Deviation	Absolute Deviation
Student 1	85	80	85 – 80 = 5	5
Student 2	75	80	75 – 80 = -5	5
Student 3	80	80	80 – 80 = 0	0

Standard Deviation is a measure of how spread out the numbers are around the mean or average.
Steps to calculate standard deviation:
1. Calculate the mean of the dataset.
2. Subtract the mean from each value.
3. Square each of the differences.
4. Find the average of the squared numbers to calculate the variance.
5. Find the square root of the variance.

Dataset: 3, 2, 5, and 6.

Step 1: Find the mean
- $\frac{3 + 2 + 5 + 6}{4} = \frac{16}{4} = 4$
Step 2: The squared differences from average mean
- $(3-4)^2 + (2-4)^2 + (5-4)^2 + (6-4)^2 = 1 + 4 + 1 + 4 = 10$

Values	Mean	Difference	Squared Differences
3	4	3-4 = -1	$(-1)^2 = 1$
2	4	2-4 = -2	$(-2)^2 = 4$
5	4	5-4 = 1	$(1)^2 = 1$
6	4	6-4 = 2	$(2)^2 = 4$

Graphically, standard deviation is depicted as the width of a bell curve around the mean of a dataset.
A wider curve indicates a larger standard deviation.

Grading Tests: Determine if students perform at the same level or if there is a higher standard deviation.
Calculating Survey Results: Measure the reliability of responses and predict how a bigger group may answer.
Weather Forecasting: Analyze the reliability of low-temperature forecasts for different cities. A low standard deviation indicates a reliable forecast.