Stat 111 Introduction to Statistics

Stat 111: Introduction to Statistics (3 Credits)

Course Content

  • Descriptive Statistical Data: Types, sources, and methods of data collection and presentation.
    • Tables, charts, and graphs.
    • Errors and approximations.
    • Frequency and cumulative frequency distributions.
    • Measures of location or central tendency, partition, dispersion, skewness, and kurtosis.
    • Rates, ratios, and index numbers.

Definition, Scope, and Limitations of Statistics

  • Statistics originated as the science of statehood but has since found applications in various fields:
    • Agriculture
    • Economics
    • Commerce
    • Biology
    • Medicine
    • Industry
    • Planning
    • Education
    • Other fields
  • Statistics is concerned with scientific methods for:
    • Collecting, organizing, summarizing, presenting, and analyzing data.
    • Deriving valid conclusions and making reasonable decisions based on analysis.
  • Statistics involves the systematic collection of numerical data and its interpretation.
  • The term 'statistic' refers to:
    • Numerical facts, such as the number of people living in a specific area.
    • The investigation of methods for gathering, analyzing, and interpreting data.

Assignment Topics

  • Discuss the applications of statistics in:
    • Industry
    • Commerce
    • Agriculture
    • Economics
    • Planning
    • Education
    • Medicine
  • Briefly discuss the advantages and disadvantages of primary and secondary data.

Types of Statistics

  1. Descriptive Statistics:
    • Primary concern is the description of large amounts of data.
    • Includes data classification and diagrammatic representation of data, such as:
      • Histograms
      • Bar charts
      • Pictograms
    • Computation of descriptive statistics such as:
      • Mean
      • Median
      • Mode
      • Range
      • Variance
      • Among others
  2. Inferential or Inductive Statistics:
    • Allows drawing conclusions about the population based on samples drawn from the population.

Nature of Statistical Data

  • Quantitative Variables:
    • Statistics deals with numerical data.
    • Business, economics, social, and scientific data can all be measured or counted directly.
    • Examples:
      • Daily sales data
      • Value of commodity's export or import
      • Number of completed buildings
      • Number of registered contractors
      • Wages
  • Qualitative Variables:
    • Information that is not numerical in nature, such as skin, eye, and hair color, level of education, sex, and marital status.
    • Qualitative variables cannot be numerically measured directly.
    • The process of assigning numerical values to qualitative data is known as coding.
    • Qualitative data can be arranged in descending order of importance and assigned values in that order; this is known as ranging/ranking.

Sources of Data

  1. Primary Source:
    • The original information gathered by the investigator for conducting an investigation or study.
    • Unique in nature.
    • Derived from surveys conducted by individuals, research institutions, or organizations.
    • Accomplished through:
      • Personal investigation
      • Investigation teams
      • Questionnaires
  2. Secondary Data Source:
    • Pre-existing data collected by other people for another purpose.
    • Primarily the results of administrative activities.
    • Must be used with caution because:
      • May not provide the exact information requested.
      • May not be in the appropriate format.
      • May be insufficient or outdated.
      • May not cover the required area of interest.
    • Published statistics and historical records are secondary sources.
    • Users of statistical information are not the original data collectors.
    • Examples:
      • Daily newspapers
      • Magazines
      • Miscellaneous periodicals
      • Research reports
      • Manpower Board of Surveys
      • Central Bank of Nigeria Statistical Bulletin
      • Annual reports and statements of accounts

Methods of Data Collection

  • Primary data can be collected by the following methods:
    1. Direct observation
    2. Personal interview
    3. Questionnaire
    4. Telephone meeting
    5. Survey
    6. Census

Types of Frequency Distribution

  1. Discrete (or Ungrouped) Frequency Distribution:

    • The frequency refers to discrete values; each class is distinct and separate from the other class.

    • Non-continuity from one class to another exists.

    • Examples:

      • Number of rooms in a house
      • Number of companies registered in a country
      • Number of children in a family
    • Example Data:

      • Survey of 40 families in a village recording the number of children per family:
        2, 1, 3, 2, 1, 5, 6, 2, 2, 1, 0, 3, 4, 2, 1, 6, 3, 2, 1, 5, 3, 3, 2, 4, 2, 3, 4, 5, 4, 1, 2, 4, 3, 2, 1, 2, 3, 3, 0, 2
    • Example Frequency Distribution Table:

      No. of childrenTally marksFrequency
      0||3
      1H|7
      2H| H||8
      3H|||7
      4|||4
      5|\2
      6|\2
      Total40
  2. Continuous (or Grouped) Frequency Distribution:

    • Data collected are so large that it may not easily be managed, making it necessary to group the data through the use of intervals.
    • When data are organized by the use of intervals, the organized data is called grouped data.
    • The advantage of grouped data is that frequency distribution enables a very large array of data to be reduced to a smaller manageable size.

Basic Technical Terms in Continuous Frequency Distribution

  1. Class Limits:
    • The lowest and highest values that can be included in the class.
    • Example: In the class 50 - 100, the lowest value is 50 and the highest value is 100.
    • The two boundaries of a class are known as the lower limit (denoted by l) and the upper limit (denoted by u) of the class in statistical calculations.
  2. Class Interval:
    • Defined as the size of each group of data.
    • Example: 50-75, 75-100, 100-125 are class intervals.
    • Each grouping begins with the lower limit of a class interval and ends at the lower limit of the next succeeding class interval.
  3. Width or Size of the Class Interval:
    • If a class interval is exclusive (continuous), its width or size is the difference between the lower and upper class limits and is denoted by C.
  4. Range:
    • The difference between the largest and smallest value of the observation, denoted by R.
    • R = L - S
  5. Mid-Value or Mid-Point:
    • The central point of a class interval.
    • Calculated by adding the upper and lower limits of a class and dividing the sum by 2.
    • mid \ value = \frac{L + U}{2}
    • Example: If the class interval is 20-30, then the mid-value is \frac{20 + 30}{2} = 25
  6. Number of Class Intervals:
    • The number of class intervals in a frequency is a matter of importance.
    • The number of class intervals should not be too many, and an ideal frequency distribution can vary from 5 to 15 class intervals.
    • To decide the number of class intervals, choose the lowest and the highest values. The difference between them will enable us to decide the class intervals.
    • Can be decided with the help of Sturges' Rule.
    • According to Sturges' Rule: K = 1 + 3.322 \log N
      • Where:
        • N = total number of observations
        • \log = logarithm of the number
        • K = number of class intervals
    • Example: If the number of observations is 10, then the number of class intervals is K = 1 + 3.322 \log 10 = 4.322
  7. Size of the Class Interval:
    • The size of the class interval is inversely proportional to the number of class intervals in a given distribution.
    • The approximate value of the size (or width or magnitude) of the class interval C is obtained by using Sturge's rule.
    • Size \ of \ class \ interval = C = \frac{Range}{Number \ of \ class \ Internal} = \frac{R}{K}
    • C = \frac{R}{1 + 3.322 \log N}
      • Where R = range

Types of Class Intervals

  • There are three methods of classifying data according to class intervals:

    • Exclusive (continuous) method
    • Inclusive (Discrete) method
    • Open-end classes
  1. Exclusive (Continuous) Method:

    • Type of class interval in which the class interval overlaps.

    • Example Expenditure and Number of Families:

      Expenditure (\)Number of families
      0-500060
      5000-10,00095
      10000-15000122
      15000-2000083
      20000-2500040
      Total400
    • The first class interval implies all data from 0 to 4999.99; 5000 is not included in the first class but in the second class, and so on.

  2. Inclusive (Discrete) Method:

    • In this method, the overlapping of the class intervals is avoided.

    • Both the lower and upper limits are included in the class interval.

    • Used for a grouped data frequency distribution for discrete variables like members in a family or number of workers in a factory where the variable may take only integral values and cannot be used with fractional values like age, height, or weight.

    • Example Distribution:

      Class Interval (C.I)Frequency
      5-95
      10-147
      15-1912
      20-2421
      25-2910
      30-345
      Total70
    • To decide whether to use the inclusive or exclusive method, it is important to determine whether the variable under observation is continuous or discrete.

    • In the case of continuous variables, the exclusive method must be used, and the inclusive method should be used in the case of discrete variables.

  3. Open-End Classes:

    • A class limit is missing either at the lower end of the first class interval or at the upper end of the last class interval or both, and classes are not specified.

    • The necessity of open-end classes arises in practical situations, particularly relating to economics and medical data when there are few very high values or few very low values which are far apart from the majority of observations.

    • Example:

      Salary RangeNumber of workers
      Below 20007
      2000-40005
      4000-60006
      6000-80004
      8000 and above3
      Total25

Preparation of Frequency Table

  • Example: Given the numbers of tools produced by workers in a factory:

    38, 25, 13, 14, 27, 41, 47, 17, 32, 25, 43, 18, 25, 18, 39, 44, 19, 20, 20, 26, 40, 45, 34, 31, 32, 27, 33, 37, 25, 26, 33, 28, 31, 34, 35, 46, 29, 34, 31, 34, 35, 24, 30, 41, 32, 29, 28, 30, 31, 30, 34, 35, 36, 29, 26, 32, 36, 35, 36, 37, 23, 32, 23, 22, 29, 33, 37, 33, 27, 24, 36, 42, 29, 37, 29, 23, 44, 41, 45, 39, 21, 42, 22, 28, 22, 15, 16, 17, 21, 22, 29, 35, 31, 27, 40, 23, 32, 40, 37

  • Use Sturges' rule to determine the number of class intervals and prepare a frequency distribution table.

    • Solution:

      • Number of class intervals: K = 1 + 3.322 \log N = 1 + 3.322 \log 100 = 7.6

      • Class Interval Size: C = \frac{R}{K} = \frac{46 - 13}{7.6} = 4.34 \approx 5

      • Taking C = 5, we have the classes 13-17, 18-22, 43-47 as inclusive types.

      • Frequency Table:

        C.ITallyFrequency
        13-17H|6
        18-22H| |12
        23-27H||\10
        28-32H| H| ||14
        33-37H| H|11
        38-42||7
        43-47||\4
        Total100

Histogram

  • A histogram is also called a block frequency diagram.

  • Frequency distribution can be represented in the form of graphs and charts.

  • Histogram is a continuous distribution, and if the class interval is discrete, we need to adjust it to a continuous one before the histogram is drawn by subtracting 0.5 from lower classes and adding it to upper classes.

  • The histogram is constructed by plotting the class boundaries frequency against class boundaries.

  • Example: The scores of thirty students in a statistics examination were given as follows:

    126, 145, 137, 145, 140, 146, 131, 143, 127, 133, 134, 144, 136, 135, 128, 130, 137, 142, 141, 139, 147, 149, 150, 148, 146, 150, 148, 151, 153, 155

    Use the above information to obtain the histogram of the distribution.

    C.IC.BU.C.BF
    0-1260-125.5125.50
    126-130125.5-130.5130.54
    131-135130.5-135.5135.54
    136-140135.5-140.5140.55
    141-145140.5-145.5145.56
    146-150145.5-150.5150.58
    151-155150.5-155.5155.53
    155-160155.5-160.5160.50

Frequency Polygon

  • It is obtained by plotting the midpoints of each class interval and the corresponding frequency of that class.

  • It can also be obtained by joining the mid-points of the tops of the rectangles of the histogram and extending the line to meet the x-axis.

  • A polygon drawn will have the same area as the corresponding histogram if the class intervals are the same.

  • Using the data plot the frequency polygon of the distribution.

    C.IC.BU.C.BFMid-value
    0-1260-125.5125.5062.75
    126-130125.5-130.5130.54128.00
    131-135130.5-135.5135.54133.00
    136-140135.5-140.5140.55138.00
    141-145140.5-145.5145.56143.00
    146-150145.5-150.5150.58148.00
    151-155150.5-155.5155.53153.00
    156-160155.5-160.5160.50158.00

Cumulative Frequency Distribution (Ogive) or (C.F. Curve)

  • Is obtained by plotting cumulative frequency against the upper-class boundary. It can be used to evaluate the median, quartiles, percentiles, deciles, and interquartile range.

  • The graph is usually an S shape.

  • Using the data plot Cumulative frequency curve (ogive) of the distribution.

    C. IC. BU.B. C. BFC-F
    0-1260-125.5125.500
    126-130125.5-130.5130.544
    131-135130.5-135.5135.548
    136-140135.5-140.5140.5513
    141-145140.5-145.5145.5619
    146-150145.5-150.5150.5827
    151-155150.5-155.5155.5330

Errors and Approximations

  • Statistical errors are the difference between the actual magnitude of the object in question and the magnitude of the estimation of the objects given by the enumerator or researcher.

  • For example, an investigator estimated that 4,832 people use a particular toothpaste in an area, but the actual number of people that use the toothpaste is 5,241.

    • Then statistical error = actual number of people - estimated number of people:

      5241 - 4832 = 409

  • Statistical error is not the same as the errors in the process of calculating your estimated error.

Causes of Statistical Error

  • Errors due to the measuring instruments used; some instruments used may not have the capacity to give accurate measurements.
  • Errors due to the inability of the researcher to correctly use the instrument or improper units of measurements.
  • Errors due to wrong information supplied by the respondents themselves.
  • Errors as a result of small samples that may not be a true representative of the population concerned.
  • Errors due to unnecessary approximations of the measurements of objects.

Measurement of Statistical Errors

  1. Absolute Error:

    • This is given by subtracting the estimated value from the actual value.

      Absolute \ error = Actual \ value - Estimated \ value

    • Example: The estimated no of employees that will resign from his/her employment after 10 years of service is 15. The actual number that resigns is 17.

      A.E = 17 - 15 = 2

  2. Relative Error (or percentage error):

    • This is the actual error committed (Absolute error) divided by the estimated value. When this proportion is multiplied by 100 it becomes a percentage error.

      R.E = \frac{Absolute \ error}{Estimated \ value}

      P.\epsilon = \frac{A.E}{E.V} \times 100\%

Measures of Central Tendency

  • Measures of central tendency or measures of location, simply called averages, are widely used statistical measures.

  • It is the measure of locating a central value, which has the tendency of other values in the distribution clustering around it.

  • The measure is very important in the sense that such value when determined can be considered to be the representative of the group.

  • The five measures of central tendency are:

    1. Arithmetic mean or simple mean.
    2. Median
    3. Mode
    4. Geometric mean
    5. Harmonic mean

Arithmetic Mean

  • Defined as the sum of the observations divided by the number of observations.

  • If the variable x assumes values X1, X2, X3, …, Xn, then the mean \bar{X} is given by:

    \bar{X} = \frac{X1 + X2 + X3 + … + Xn}{n} = \frac{\sum{i=1}^{n} Xi}{n}

    • This formula is for the ungrouped or raw data.

    • Example: Calculate the mean for 2, 4, 6, 8, 10

      \bar{X} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6

Grouped Data

  • The mean for grouped data is obtained from the following formula:

    \bar{X} = \frac{\sum{i=1}^{n} fi xi}{\sum{i=1}^{n} f_i}

    • Where:

      • x_i = midpoint of individual class

      • f_i = the frequency of the Individual class

      • \sum f_i = the sum of the frequencies or total frequencies

  • Example: Given the following frequency distribution, calculate the arithmetic mean:

    MarkNo. of students
    648
    6318
    6212
    619
    607
    596

    \bar{X} = \frac{\sum fx}{N} = \frac{3713}{60} = 61.88

  • Example: Calculate the arithmetic mean of the marks from the following table:

    MarksNumber of Students
    0-1012
    10-2018
    20-3027
    30-4020
    40-5017

    \bar{X} = \frac{\sum fx}{N} = \frac{2470}{94} = 26.28

Median

  • The median is that value of the variable that divides the group into two equal parts, one part comprising all values greater, and the other, all values less than the median.

Ungrouped or Raw Data

  • Arrange the given values in increasing or decreasing order. If the number of values are odd, the median is the middle value. If the number of values are even, the median is the average (mean) of the middle two values.

    • Example: Find the median of 2, 1, 4, 3, 6, 5

      • Solution: 1, 2, 3, 4, 5, 6

        Median = \frac{3 + 4}{2} = 3.5

Median of Discrete Frequency Distribution

  • In the case of a discrete frequency distribution, the median is obtained by considering the Cumulative frequencies. The steps for calculating the median are given below:

    1. Find \frac{N}{2}, where N = \sum f

    2. See the Cumulative frequency (C.F) just greater than \frac{N}{2}

    3. The Corresponding value of the x is median.

  • Example: Obtain the median for the following frequency distribution:

    | x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
    | :- | - | - | - | -- | -- | -- | -- | -- | -- | -- | -- |
    | f | 3 | 4 | 8 | 10 | 16 | 20 | 25 | 15 | 9 | 6 | 4 |

    • Solution: N = 120, \frac{N}{2} = 60

      • The value of x corresponding to 65 is 5. Therefore, the median is 5.

Median for Grouped Data

  • For grouped data, the median is defined as:

    Median = Lm + (\frac{\frac{n}{2} - cf{bm}}{f_m}) \times c

    • Where,

      • L_m = lower class boundary of the median class

      • cf_{bm} = Cumulative frequency before the median class

      • f_m = Frequency of the median class

      • c = class size or width

      • \frac{n}{2} = The median position (to help identify the median class)

  • Example The errors discovered in the lengths of rods produced in a factory in (millimeter) are given below :

    Errors in length (mm)19-2122-2425-2728-3031-3334-3637-39
    Number of rods912182319136

    Estimate the median error in the length of rods.

    Median = Lm + (\frac{\frac{n}{2} - cf{bm}}{f_m}) \times c = 27.5 + (\frac{50-39}{23})\times 3 = 27.5 + 1.43 = 28.93

Mode

  • For ungrouped data, the mode is the value that occurs most frequently.

  • However, it is easy to understand, it may not be unique or clearly defined, as some distributions may have more than one mode.

  • A distribution with one mode is called a unimodal distribution; a distribution with two modes is called a bimodal distribution; and a distribution with more than three modes is referred to as a multi-modal distribution.

  • Examples:
    Find the mode of the following distributions;

    14, 19, 16, 21, 18, 24, 15 ad 19

    • The mode is 19 (unimodal distribution)
  • Mode from Frequency Data: In frequency data, the mode is the number with the highest frequency.

  • Example:

    Find the mode of the distribution x: 1, 2, 3, 4, 5, 6, 7, 8, 9 and frequency f: 4, 9, 16, 25, 12, 15, 7, 3, 1

    • The value of x corresponding to the maximum frequency of 25 is 4.

Mode for Grouped data

  • When data are grouped, the mode can be obtained using the following formula:

    mode = Lm +(\frac{\Delta1}{\Delta1+\Delta2})\times c

    • Where:

      • L_m = lower class boundary of the modal class

      • \Delta_1 = the difference between modal class frequency and the frequency of the next upper class