Introduction to Descriptive Statistics

Introduction to Statistics

  • Learning Outcomes:
    • Understand statistics.
    • Identify types of statistics.
    • Explain descriptive statistics.
    • Use SPSS to measure descriptive statistics.

What is Statistics?

  • Statistics involves:
    • Collecting data.
    • Describing data.
    • Interpreting data to make inferences and draw conclusions.

Who Uses Statistics?

  • Statistical techniques are used extensively by:
    • Marketing professionals
    • Accountants
    • Quality control experts
    • Consumers
    • Professional sports people
    • Hospital administrators
    • Educators
    • Politicians
    • Researchers
    • Students

Basic Terms

  • Population: A collection or set of individuals, objects, or events whose properties are analyzed.
  • Sample: A subset of a population, representing it through sampling techniques.
  • Variable (or response variable): A characteristic of interest about each individual element of a population or sample.
  • Data: The set of values collected for a variable from each element in the sample.

Scale of Measurement / Classification of Variables

  • Categorical: Describes or categorizes an element of a population.
    • Nominal
    • Ordinal
  • Numerical: Quantifies an element of a population.
    • Interval
    • Ratio

Categorical Variable Types

  • Nominal Variable:
    • Classifies characteristics into categories.
    • Data is mutually exclusive and not in rank order.
    • Examples:
      • Gender, races
    • Dichotomous variable examples:
      • Patient status (1 = alive, 2 = death)
      • Blood Pressure (1 = high, 2 = low)
  • Ordinal Variable:
    • Incorporates an ordered position or ranking.
    • Differences/distances between ranks do not exist.
    • Mutually exclusive.
    • Examples:
      • Socioeconomic Status (1 = Low, 2 = Intermediate, 3 = High)
      • Attitude Scale (1 = Strongly Agree, 2 = Agree, 3 = Neutral, 4 = Disagree, 5 = Strongly Disagree)

Numerical Variable Types

  • Interval Variable:
    • Quantitative scales variables (discrete or continuous).
      • Discrete variable: A quantitative variable that can assume a countable number of values (gaps between values).
      • Continuous variable: A quantitative variable that can assume an uncountable number of values (with decimal values).
    • Zero point is arbitrary.
    • Able to add or subtract.
    • Examples:
      • Continuous variable: Temperature scale (37.5^\circ C, 38.2^\circ C, etc.)
      • Discrete variable: No. of children (1, 7, 10, etc.)
  • Ratio Variable:
    • Very similar to interval but zero point is not arbitrary.
    • Able to multiply or divide the values.
    • Examples:
      • Temperature in Kelvin scale (0 point is physically zero).
      • Blood pressure (120 mmHg / 80 mmHg).

Study Variable

  1. Dependent or Outcome Variable:
    • An outcome whose variation the study seeks to describe, explain, or account for by the influence of independent or explanatory variables.
  2. Independent Variable / Explanatory Variable:
    • The variable hypothesized to influence the outcome variable under study; the hypothetical causal variable.
    • Examples:
      • Difference between football players and basketball players in relation to their leg power score.
      • The difference in job satisfaction between years of working experience.

Classification of Statistics

  1. Descriptive Statistics:
    • Describe what happened in a particular study.
    • A collection, presentation, and description of sample data.
    • Examples: tables, graphs, etc.
  2. Inferential Statistics:
    • Draw conclusions about what those results mean in some broader context.
    • Techniques of interpreting values from descriptive techniques, making decisions, and drawing conclusions about the population.
    • Allow researchers to generalize characteristics of a population from the observed characteristics of a sample.

Descriptive Statistics

  • Categorical Variable

    • Frequency
    • Bar graph/chart
  • Numerical Variable

    • Measure of Central Tendency (mean, median, mode)
    • Measure of Variability (variance, standard deviation, range, outlier)
    • Measure group position (quartile, interquartile range, percentiles and percentile ranks, standard score [Z-score])
    • Graphical presentation

Frequency Table

  • A listing, often expressed in chart form, that pairs each value of a variable with its frequency.
  • Tables organize data into values and categories with titles and captions.
  • A frequency table may include:
    • Categories
    • Frequency
    • Cumulative frequency
    • Relative frequency
    • Proportion (%)

Example of Frequency Table

  • Frequencies of Gender

    GenderCounts% of TotalCumulative %
    Male1653.3%53.3%
    Female1446.7%100.0%

Frequency Table APA 7th Style

Table 1: Descriptive result of gender

VariableFrequency (n)Percentage (%)
Gender
Male1653.3
Female1446.7
Total30100.0

Bar Graph / Chart

  • Shows the amount of data that belongs to each category as proportionally sized rectangular areas.
  • Graphical presentation of frequency distribution of categorical data (nominal or ordinal).
  • Y-axis: Frequency or relative frequency.
  • Height represents frequency or percent.
  • X-axis: Categorical variables.
  • Bars separated by equal gaps.
  • Bars of equal width.

Qualities of an Excellent Graph

  • No distortion of the data
  • Represents large data sets concisely and coherently
  • Ideas and concepts clearly understood by the viewer
  • Display induces the viewer to address the substance of the data and not the form of the graph
  • Encourages the viewer to compare two or more variables

Measures of Central Tendency (Numerical Variables)

  1. Mean

    • Sample average.

    • Formula: \bar{x} = \frac{\sum x}{n}, where \bar{x} = sample mean, \sum x = summation of all x values, and n = sample size.

    • Sensitive to extreme values, where one data point could make a great change in the sample mean.

    • Example:

      • What is the mean of systolic blood pressure (SBP) among the cases below?

        x1 = 120, x2 = 80, x3 = 90, x4 = 100, x5 = 120, x6 = 110

      • Solution:

        • n = 6
        • \bar{x} = \frac{120+80+90+100+120+110}{6} = \frac{620}{6} = 103.33
  2. Median

    • Middle value or the 50th percentile of a set of ordered numbers/measurements.

    • When n is odd, the middle value = [(n+1)/2]^{th}.

    • When n is even, median is the average of two middle most observations.

    • Median = mean in normally distributed data.

    • Not sensitive to extreme values.

    • Examples:

      • a) n is odd:

        • In the opening round of the Christmas basketball tournament, Slippery Ice went into a freeze in the final 5 minutes of the game to preserve a 68-64 victory over a very tough and talented team from Hard Rock College. The starting five for Slippery Ice scored 12, 7, 18, 9 and 6 points. Find the median.

        • Solution:

          • Arrange the observations in order: 6, 7, 9, 12, 18
          • n=5
          • Formula: median = [(n+1)/2]^{th} = [(5+1)/2]^{th} = 3^{th}
          • median = 9
      • b) n is even:

        • A new brand of cigarettes called Wheeze has just become available to the public. The nicotine contents for a random sample of 6 of these cigarettes are 12.3, 18.1, 15.7, 16.9, 21.2, and 18.5 milligrams. Find the median.

        • Solution:

          • Arrange the observations in order: 12.3, 15.7, 16.9, 18.1, 18.5, 21.2
          • Median = middle of the observation = \frac(16.9 + 18.1)}{2} = 17.50 milligrams
  3. Mode

    • Value which occurs most often or with the greatest frequency.

    • Less useful in describing statistics.

    • It requires no calculation.

    • It can be more than 1.

    • Example:

      • The donation received from the residents of Windsor Lake toward the American Cancer Society were recorded as follows: 3,4,5,6,7,7,7,7,8,8 and 9 dollars. Find the mode.

      • Solution:

        • mode = occurs most often = 7 (4 times)

Guidelines for Using Mean, Median, and Mode

  • The mean is a good summary for values that represent magnitudes, like test marks and the cost of something.
  • The median is best used when ranking people or things, like heights or when extreme values might affect the mean.
  • The mode is best used when finding out the most popular dress size or the most popular brand of chocolate.

Hands-on example

  • Find mean, median and mode from data below :

    x1 = 12, x2 = 8, x3 = 7, x4 = 15, x5 = 10, x6 = 13, x7 = 12, x8 = 13, x9 = 9, x{10} = 7, x{11} = 14, x{12} = 11, x{13} = 15, x{14} = 9, x_{15} = 9

  • Answer:

    • Mean: \frac{(12+8+7+15+10+13+12+13+9+7+14+11+15+9+9)}{15} = \frac{164}{15} = 10.93
    • Median: 7, 7, 8, 9, 9, 9, 10, 11, 12, 12, 13, 13, 14, 15, 15 = 11
    • Mode: 9

Measures of Variability (Numerical Variables)

  1. Variance

    • Considers the position of each observation relative to the mean of the set.

    • Measures the amount of spread or variability of observations from the mean.

    • Formula: variance s^2 = \frac{\sum (x - \bar{x})^2}{n-1}

    • Example:

      • An inventory of office equipment in 4 randomly selected departments showed that physics is in possession of 2 calculators, chemistry has 5, mathematics 7, and business administration 10. Find the variance.

      • Solution:

        • Find the mean, \bar{x} = \frac{(2 + 5 + 7 + 10)}{4} = 6

        • variance, s^2 = \frac{(2 - 6)^2 + (5 - 6)^2 + (7 - 6)^2 + (10 - 6)^2}{4 – 1} = \frac{34}{3} = 11.333

  2. Standard Deviation

    • Square root of variance.
    • Most widely used and better measure of variability.
    • The smaller the value, the closer to the mean.
    • Sensitive to extreme values.
    • Formula: standard deviation s = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}}

    *Example:

    *   Jack Knife was randomly given the following scores by 6 judges on his first dive at the Hunting Hills annual swim meet: 4,5,6,6,7, and 8. Find the standard deviation.
    
    *   Solution:
    *   Find the mean, \bar{x} = \frac{(4+5+6+6+7+8)}{6} = 6
    
    *   Formula standard deviation, s = \sqrt{\frac{10}{5}} = 1.41
    

Hands on

  • 25, 23, 27, 29, 28, 25, 26.
  • Find variance and standard deviation.
  • Answer
    • Find mean: (25+23+27+29+28+25+26)/7 = 183/7 = 26.14
      Variance, s^2 = \frac{\sum (x - \bar{x})^2}{n-1} = \frac{24.86}{7-1} = 4.14
    • Standard deviation, s = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}} = \sqrt{4.14} = 2.04
  1. Range

    • Simplest and least useful measure of variability

    • Only for quick estimate of variability

    • The difference between the maximum and minimum value of the distribution

    • Tends to increase with sample size

    • Sensitive to very extreme values
      Example: The ages of the 5 children in the cast of the new musical opening at the Star City Playhouse are 8,12,12,15, and 17 years, Find the range.

      • Solution: The range of the 5 ages is max – min= 17 – 8 = 9 years
  2. Outlier

    • Also call extreme values
    • Values that are very small or very large relative to the majority of the values in a data set

Impact of Outliers

  • Outlier extremely high:
    • Mean increase, SD increase
  • Outlier extremely low:
    • mean decrease, SD increase

Measures of Group Position (Numerical Variables)

  1. Quartiles

    • 3 summary measures that divide a ranked dataset into 4 equal parts
    • First quartile, Q1 :value of the middle term among the observations that are less than the median
    • Second quartile, Q2 : same as the median of a data set
    • Third quartile, Q3 : value of the middle term among the observations that are greater than the median
  2. Inter Quartile Range (IQR)

    • The difference between the third and the first quartiles

    • IQR = Q3 – Q1

    • Example:

      • The following are the scores of 12 students in a mathematics class. 75 80 68 53 99 58 76 73 85 88 91 79

        • a) Find the values of the 3 quartile b) Find the inter quartile range (IQR).
      • Solution:
        First, rank the given scores in increasing order. Then, calculate the quartile 53 58 68 73 75 76 79 80 85 88 91 99

      • Q1 = \frac{(68+73)}{2} = 70.5

      • Q2 = \frac{(76+79)}{2} = 77.5

      • Q3 = \frac{(85+88)}{2} = 86.5

  • Interpretation:

    *   1/4 or 25% of the students has a score less than 70.5
    *   1/2 or 50% of the students has a score less than 77.5
    *   3/4 or 75% of the students has a score less than 86.5
    *   25% of students get a score more than 86.5
    
  • Inter quartile range(IQR) formula, IQR = Q3 – Q1

    • Q3 = 86.5

    • Q1 = 70.5

    • IQR = 86.5 – 70.5 = 16

    • Interpretation:

      • 50% of the students get the score between 70.5-86.5
  1. Standard score (Z-score)

    • Also called z-score

    • Position of a particular value of x has relative to the mean, measured in standard deviation

    • Used to help make a comparison of 2 raw scores that come from separate populations

    • Formula: z = \frac{x - \bar{x}}{s}

    • Example:
      You want to compare your Math score with your friend’s Math score from difference class. Your score was 45 and your friend’s score was 72points. But mean score in your class was 38points compare 65points in your friend’s class. However, the standard deviation on your class was 7 compared to 14 in your friend’s class. Which score are better

  • a: x = 45, \bar{x} = 38, s = 7
  • z score a = \frac{(45 – 38)}{7} = 1
  • * b: x = 72 = 65 s = 14
  • z score b = \frac(72 – 65)}{14} = 0.5
  • conclusion: from the result, your score is one standard deviation above the mean, but your friend’s score is only half of a standard deviation above the mean. We can conclude that your score is slightly better than your friend’s score

Standard Normal Distribution

  • Properties

    • Total area under the normal curve is equal to 1

    • Distribution is mounded and symmetric

    • Has a mean of 0 and standard deviation of 1

    • The mean divides the area in half, 0.50 on each side

    • Nearly all the area is between z=-3.00 and z=3.00

    • Example: Find the area under the standard normal curve between z=0 and z=1.52

    • Solution: By using table Standard Normal Distribution below, z=1.52 located raw labeled 1.5 and column label 0.02, at their intersection is 0.4357, the measure of area or probability for the interval z=0.00 and z=1.52.
      Area or probability expressed as: P(0.00<z<1.52) = 0.4357

Graphical Presentation

  • Graphs are the visual presentation of frequency distribution and may show

    • Differences in spread (variability)
    • Difference in shape of the distribution
  • Type of useful graphs:

    • Histogram
    • Polygon
    • Stem and leaf
    • Line graph
    • Box plot
    • Scatter plot (correlation technique)

Histogram

  • Each bar represent the interval class
  • Normality curve line
  • Bar height represent frequency or percent
  • Interval class, no gaps in between

Stem and Leaf Plot

  • Another tool for visually displaying continuous data

  • Very similar to a histogram

  • Allow for the easier identification of individual values in the simple

  • Each numerical value is divided into 2 parts:

    • The leading digit becomes the STEM
    • The trailing digit becomes the LEAF
  • Example:
    Let’s construct a stem-and-leaf display for the 19 exam score below:

74 82 96 66 76 78 72 52 68 86 84 62 76 78 92 82 74 88 79

Solution
  • At a quick glance, we can see scores in 50s, 60s, 70s, 80, and 90s.
    *Display in vertical position and place the stem
    Places leaf for each stem and continues until end of all 19 score.
    Result for stem-and-leaf display as below.
5 | 2
6 | 6 8 2
7 | 6 4 6 8 2 6 8 4
8 | 2 6 4 2 8
9 | 6 2

stem | leaf

  • Hands on

175 180 168 153 199 158 176 173 185 188 191 179 155 175 166 175 178 190

Box Plot

  • A graphical display that use descriptive statistics based on percentile
  • Also called “5 number summary plot” : min, max,Q1,Q2 and Q3
  • Provide information about central tendency and the variability of the middle 50% of the distribution
    • The box represent the IQR, 25th to 75th percentile
    • Outlier observations is 1.5 times the IQR away from the edges of the box (>3.0 times is extreme values)
    • Smallest and largest values that make up the lines are the nearest values outside the outliers
  • Can easily comparing continuous data in multiple groups : can plotted side by side

Scatter Plot

  • All the ordered pairs of bivariate data on a coordinate axis system
  • The input variable, x is plotted on the horizontal axis
  • The output variable, y is plotted on the vertical axis
  • Can be use as a basic graphical presentation in correlation