Data Science - Chapter 1: Use of Statistics in Data Science

Subset

  • A subset is a smaller set of data taken from a larger dataset for analysis.
  • It is a useful indexing feature for accessing object elements, selecting variables, and filtering observations.
  • Subsetting allows focusing on the required data by filtering out unnecessary content.
  • Example: Selecting the first 5 rows and 5 columns from a table of 100 rows and 100 columns to create a subset.

Different Ways to Subset Data

  1. Row-based subsetting: Taking some rows from the top or bottom of the table. Example: Selecting the top 3 rows from a table of 6 rows and 4 columns.
  2. Column-based subsetting: Selecting specific columns from the dataset. This is useful when the original dataset contains a large number of columns, and not all are necessary for analysis.
  3. Data-based subsetting: Subsetting the data based on specific data values or conditions.

Two-Way Frequency Table

  • A two-way table is a statistical table that displays the observed number or frequency for two variables.
  • Rows indicate one category, and columns indicate the other category, showing how many data points fit in each category.
  • Example: A table showing the relationship between age groups ("5-10 years", "10-15 years", "15-20 years") and their preference for chocolates ("Like chocolates" or "Do not like chocolates").
  • Each cell in the table represents the number (or frequency) of people in that specific category.

Interpreting Two-Way Tables

  • Categories are listed in the left column and top row.
  • Counts are placed in the center of the table.
  • Totals are at the end of each row and column.
  • The sum of all counts (total) is placed at the bottom right.

Two-Way Relative Frequency Table

  • Similar to a two-way frequency table, but it displays percentages instead of numbers.
  • Represents the percentage of data points that fit in each category.
  • Row relative frequencies or column relative frequencies can be used, depending on the context of the problem.
  • Helpful when there are different sample sizes in a dataset because percentages make it easier to compare preferences.

Mean

  • Mean is a measure of central tendency, also known as the simple average.
  • It is the average value of a dataset, around which the entire data is spread out.
  • All values are weighted equally when calculating the mean.
  • Calculation: Sum of all values in the dataset divided by the number of values.
  • Example: Finding the mean of the numbers 10 to 20 (11 numbers).
    • Sum = 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 = 165
    • Number of values = 11
    • Mean = 16511=15\frac{165}{11} = 15

Median

  • The median is another measure of central tendency, representing the middle point of a sorted dataset.
  • To calculate the median, the dataset must be ordered in ascending or descending order.
  • If the data set is sorted from smallest value to biggest value, the exact middle value of the set is the Median
  • If the number of data points is odd, the median is the middle data point in the list.
    • Example: Find the median of the given data set {1, 4, 2,5, 0}
      • Put the data in ascending order: 0, 1, 2, 4, 5
      • Median = 2
  • If the number of data points is even, the median is the average of the two middle data points in the list.
    • Example: Find the median of the given data set {10, 40, 20, 50}
      • Put the data in ascending order: 10, 20, 40, 50
      • The two middle data points are 20 and 40
      • Median = 20+402=602=30\frac{20 + 40}{2} = \frac{60}{2} = 30

Mean vs. Median

  • The median is a more accurate measure of central tendency when there are outliers or irregular values in the dataset.

Mean Absolute Deviation

  • Mean Absolute Deviation (MAD) is the average of how far away all values in a data set are from the mean.
  • Steps to calculate MAD:
    1. Calculate the mean of the dataset.
    2. Calculate the distance of each data point from the mean and take the absolute value (ignore the negative sign).
    3. Calculate the mean of the distances.
  • MAD provides a good understanding of the variability or scatter of the data set.

Example

  • Dataset: Maximum marks (in percentage) of three students: 85, 75, 80.

    1. Find the Mean
      85+75+803=80\frac{85+75+80}{3} = 80
    2. Calculate the distance of each data point from the mean. We need to find the absolute value. For example if the distance is -2, then we ignore the negative sign.
    Student NameMaximum Marks %MeanDeviationAbsolute Deviation
    Student 1858085 – 80 = 55
    Student 2758075 – 80 = -55
    Student 3808080 – 80 = 00
    • Mean of distances = 5+5+03=3.33\frac{5 + 5 + 0}{3} = 3.33
    • MAD = 3.33
    • The mean is 80.

Standard Deviation

  • Standard Deviation is a measure of how spread out the numbers are around the mean or average.
  • Steps to calculate standard deviation:
    1. Calculate the mean of the dataset.
    2. Subtract the mean from each value.
    3. Square each of the differences.
    4. Find the average of the squared numbers to calculate the variance.
    5. Find the square root of the variance.

Example

Dataset: 3, 2, 5, and 6.

  • Step 1: Find the mean
    • 3+2+5+64=164=4\frac{3 + 2 + 5 + 6}{4} = \frac{16}{4} = 4
  • Step 2: The squared differences from average mean
    • (34)2+(24)2+(54)2+(64)2=1+4+1+4=10(3-4)^2 + (2-4)^2 + (5-4)^2 + (6-4)^2 = 1 + 4 + 1 + 4 = 10
ValuesMeanDifferenceSquared Differences
343-4 = -1(1)2=1(-1)^2 = 1
242-4 = -2(2)2=4(-2)^2 = 4
545-4 = 1(1)2=1(1)^2 = 1
646-4 = 2(2)2=4(2)^2 = 4
  • Step 3: Find Variance
    • Variance = 104=2.5\frac{10}{4} = 2.5
  • Step 4: Standard deviation
    • 2.5=1.58\sqrt{2.5} = 1.58
  • Standard deviation = 1.58

Graph Representation

  • To represent standard deviation on a graph:
    1. Draw an empty graph (x and y axis).
    2. Plot the mean as the center value (e.g., 3.8).
    3. Calculate points on the x-axis by adding/subtracting multiples of the standard deviation (SD) from the mean:
      • +1SD=3.8+2.48=6.28+1SD = 3.8 + 2.48 = 6.28
      • +2SD=8.76+2SD = 8.76
      • 1SD=3.82.48=1.32-1SD = 3.8 - 2.48 = 1.32
      • 2SD=1.16-2SD = -1.16
    4. Draw a bell curve and mark standard deviations.

Explaining Standard Deviation on a Graph

  • Graphically, standard deviation is depicted as the width of a bell curve around the mean of a dataset.
  • A wider curve indicates a larger standard deviation.

Real-Life Implementations of Standard Deviation

  1. Grading Tests: Determine if students perform at the same level or if there is a higher standard deviation.
  2. Calculating Survey Results: Measure the reliability of responses and predict how a bigger group may answer.
  3. Weather Forecasting: Analyze the reliability of low-temperature forecasts for different cities. A low standard deviation indicates a reliable forecast.