Data Science - Chapter 1: Use of Statistics in Data Science
Subset
- A subset is a smaller set of data taken from a larger dataset for analysis.
- It is a useful indexing feature for accessing object elements, selecting variables, and filtering observations.
- Subsetting allows focusing on the required data by filtering out unnecessary content.
- Example: Selecting the first 5 rows and 5 columns from a table of 100 rows and 100 columns to create a subset.
Different Ways to Subset Data
- Row-based subsetting: Taking some rows from the top or bottom of the table. Example: Selecting the top 3 rows from a table of 6 rows and 4 columns.
- Column-based subsetting: Selecting specific columns from the dataset. This is useful when the original dataset contains a large number of columns, and not all are necessary for analysis.
- Data-based subsetting: Subsetting the data based on specific data values or conditions.
Two-Way Frequency Table
- A two-way table is a statistical table that displays the observed number or frequency for two variables.
- Rows indicate one category, and columns indicate the other category, showing how many data points fit in each category.
- Example: A table showing the relationship between age groups ("5-10 years", "10-15 years", "15-20 years") and their preference for chocolates ("Like chocolates" or "Do not like chocolates").
- Each cell in the table represents the number (or frequency) of people in that specific category.
Interpreting Two-Way Tables
- Categories are listed in the left column and top row.
- Counts are placed in the center of the table.
- Totals are at the end of each row and column.
- The sum of all counts (total) is placed at the bottom right.
Two-Way Relative Frequency Table
- Similar to a two-way frequency table, but it displays percentages instead of numbers.
- Represents the percentage of data points that fit in each category.
- Row relative frequencies or column relative frequencies can be used, depending on the context of the problem.
- Helpful when there are different sample sizes in a dataset because percentages make it easier to compare preferences.
Mean
- Mean is a measure of central tendency, also known as the simple average.
- It is the average value of a dataset, around which the entire data is spread out.
- All values are weighted equally when calculating the mean.
- Calculation: Sum of all values in the dataset divided by the number of values.
- Example: Finding the mean of the numbers 10 to 20 (11 numbers).
- Sum = 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 = 165
- Number of values = 11
- Mean =
Median
- The median is another measure of central tendency, representing the middle point of a sorted dataset.
- To calculate the median, the dataset must be ordered in ascending or descending order.
- If the data set is sorted from smallest value to biggest value, the exact middle value of the set is the Median
- If the number of data points is odd, the median is the middle data point in the list.
- Example: Find the median of the given data set {1, 4, 2,5, 0}
- Put the data in ascending order: 0, 1, 2, 4, 5
- Median = 2
- Example: Find the median of the given data set {1, 4, 2,5, 0}
- If the number of data points is even, the median is the average of the two middle data points in the list.
- Example: Find the median of the given data set {10, 40, 20, 50}
- Put the data in ascending order: 10, 20, 40, 50
- The two middle data points are 20 and 40
- Median =
- Example: Find the median of the given data set {10, 40, 20, 50}
Mean vs. Median
- The median is a more accurate measure of central tendency when there are outliers or irregular values in the dataset.
Mean Absolute Deviation
- Mean Absolute Deviation (MAD) is the average of how far away all values in a data set are from the mean.
- Steps to calculate MAD:
- Calculate the mean of the dataset.
- Calculate the distance of each data point from the mean and take the absolute value (ignore the negative sign).
- Calculate the mean of the distances.
- MAD provides a good understanding of the variability or scatter of the data set.
Example
Dataset: Maximum marks (in percentage) of three students: 85, 75, 80.
- Find the Mean
- Calculate the distance of each data point from the mean. We need to find the absolute value. For example if the distance is -2, then we ignore the negative sign.
Student Name Maximum Marks % Mean Deviation Absolute Deviation Student 1 85 80 85 – 80 = 5 5 Student 2 75 80 75 – 80 = -5 5 Student 3 80 80 80 – 80 = 0 0 - Mean of distances =
- MAD = 3.33
- The mean is 80.
- Find the Mean
Standard Deviation
- Standard Deviation is a measure of how spread out the numbers are around the mean or average.
- Steps to calculate standard deviation:
- Calculate the mean of the dataset.
- Subtract the mean from each value.
- Square each of the differences.
- Find the average of the squared numbers to calculate the variance.
- Find the square root of the variance.
Example
Dataset: 3, 2, 5, and 6.
- Step 1: Find the mean
- Step 2: The squared differences from average mean
| Values | Mean | Difference | Squared Differences |
|---|---|---|---|
| 3 | 4 | 3-4 = -1 | |
| 2 | 4 | 2-4 = -2 | |
| 5 | 4 | 5-4 = 1 | |
| 6 | 4 | 6-4 = 2 |
- Step 3: Find Variance
- Variance =
- Step 4: Standard deviation
- Standard deviation = 1.58
Graph Representation
- To represent standard deviation on a graph:
- Draw an empty graph (x and y axis).
- Plot the mean as the center value (e.g., 3.8).
- Calculate points on the x-axis by adding/subtracting multiples of the standard deviation (SD) from the mean:
- Draw a bell curve and mark standard deviations.
Explaining Standard Deviation on a Graph
- Graphically, standard deviation is depicted as the width of a bell curve around the mean of a dataset.
- A wider curve indicates a larger standard deviation.
Real-Life Implementations of Standard Deviation
- Grading Tests: Determine if students perform at the same level or if there is a higher standard deviation.
- Calculating Survey Results: Measure the reliability of responses and predict how a bigger group may answer.
- Weather Forecasting: Analyze the reliability of low-temperature forecasts for different cities. A low standard deviation indicates a reliable forecast.