Stat 111 Introduction to Statistics

Stat 111: Introduction to Statistics (3 Credits)

Course Content

Descriptive Statistical Data: Types, sources, and methods of data collection and presentation.
- Tables, charts, and graphs.
- Errors and approximations.
- Frequency and cumulative frequency distributions.
- Measures of location or central tendency, partition, dispersion, skewness, and kurtosis.
- Rates, ratios, and index numbers.

Definition, Scope, and Limitations of Statistics

Statistics originated as the science of statehood but has since found applications in various fields:
- Agriculture
- Economics
- Commerce
- Biology
- Medicine
- Industry
- Planning
- Education
- Other fields
Statistics is concerned with scientific methods for:
- Collecting, organizing, summarizing, presenting, and analyzing data.
- Deriving valid conclusions and making reasonable decisions based on analysis.
Statistics involves the systematic collection of numerical data and its interpretation.
The term 'statistic' refers to:
- Numerical facts, such as the number of people living in a specific area.
- The investigation of methods for gathering, analyzing, and interpreting data.

Assignment Topics

Discuss the applications of statistics in:
- Industry
- Commerce
- Agriculture
- Economics
- Planning
- Education
- Medicine
Briefly discuss the advantages and disadvantages of primary and secondary data.

Types of Statistics

Descriptive Statistics:
- Primary concern is the description of large amounts of data.
- Includes data classification and diagrammatic representation of data, such as:
  - Histograms
  - Bar charts
  - Pictograms
- Computation of descriptive statistics such as:
  - Mean
  - Median
  - Mode
  - Range
  - Variance
  - Among others
Inferential or Inductive Statistics:
- Allows drawing conclusions about the population based on samples drawn from the population.

Nature of Statistical Data

Quantitative Variables:
- Statistics deals with numerical data.
- Business, economics, social, and scientific data can all be measured or counted directly.
- Examples:
  - Daily sales data
  - Value of commodity's export or import
  - Number of completed buildings
  - Number of registered contractors
  - Wages
Qualitative Variables:
- Information that is not numerical in nature, such as skin, eye, and hair color, level of education, sex, and marital status.
- Qualitative variables cannot be numerically measured directly.
- The process of assigning numerical values to qualitative data is known as coding.
- Qualitative data can be arranged in descending order of importance and assigned values in that order; this is known as ranging/ranking.

Sources of Data

Primary Source:
- The original information gathered by the investigator for conducting an investigation or study.
- Unique in nature.
- Derived from surveys conducted by individuals, research institutions, or organizations.
- Accomplished through:
  - Personal investigation
  - Investigation teams
  - Questionnaires
Secondary Data Source:
- Pre-existing data collected by other people for another purpose.
- Primarily the results of administrative activities.
- Must be used with caution because:
  - May not provide the exact information requested.
  - May not be in the appropriate format.
  - May be insufficient or outdated.
  - May not cover the required area of interest.
- Published statistics and historical records are secondary sources.
- Users of statistical information are not the original data collectors.
- Examples:
  - Daily newspapers
  - Magazines
  - Miscellaneous periodicals
  - Research reports
  - Manpower Board of Surveys
  - Central Bank of Nigeria Statistical Bulletin
  - Annual reports and statements of accounts

Methods of Data Collection

Primary data can be collected by the following methods:
1. Direct observation
2. Personal interview
3. Questionnaire
4. Telephone meeting
5. Survey
6. Census

Types of Frequency Distribution

Discrete (or Ungrouped) Frequency Distribution:
- The frequency refers to discrete values; each class is distinct and separate from the other class.
- Non-continuity from one class to another exists.
- Examples:
  - Number of rooms in a house
  - Number of companies registered in a country
  - Number of children in a family
- Example Data:
  - Survey of 40 families in a village recording the number of children per family:
    2, 1, 3, 2, 1, 5, 6, 2, 2, 1, 0, 3, 4, 2, 1, 6, 3, 2, 1, 5, 3, 3, 2, 4, 2, 3, 4, 5, 4, 1, 2, 4, 3, 2, 1, 2, 3, 3, 0, 2
- Example Frequency Distribution Table:
  No. of children Tally marks Frequency
  0 || 3
  1 H| 7
  2 H| H|| 8
  3 H||| 7
  4 ||| 4
  5 |\ 2
  6 |\ 2
  Total 40
Continuous (or Grouped) Frequency Distribution:
- Data collected are so large that it may not easily be managed, making it necessary to group the data through the use of intervals.
- When data are organized by the use of intervals, the organized data is called grouped data.
- The advantage of grouped data is that frequency distribution enables a very large array of data to be reduced to a smaller manageable size.

No. of children	Tally marks	Frequency
0	\|\|	3
1	H\|	7
2	H\| H\|\|	8
3	H\|\|\|	7
4	\|\|\|	4
5	\|\	2
6	\|\	2
Total		40

Basic Technical Terms in Continuous Frequency Distribution

Class Limits:
- The lowest and highest values that can be included in the class.
- Example: In the class 50 - 100, the lowest value is 50 and the highest value is 100.
- The two boundaries of a class are known as the lower limit (denoted by l) and the upper limit (denoted by u) of the class in statistical calculations.
Class Interval:
- Defined as the size of each group of data.
- Example: 50-75, 75-100, 100-125 are class intervals.
- Each grouping begins with the lower limit of a class interval and ends at the lower limit of the next succeeding class interval.
Width or Size of the Class Interval:
- If a class interval is exclusive (continuous), its width or size is the difference between the lower and upper class limits and is denoted by C.
Range:
- The difference between the largest and smallest value of the observation, denoted by R.
- R = L - S
Mid-Value or Mid-Point:
- The central point of a class interval.
- Calculated by adding the upper and lower limits of a class and dividing the sum by 2.
- mid \ value = \frac{L + U}{2}
- Example: If the class interval is 20-30, then the mid-value is \frac{20 + 30}{2} = 25
Number of Class Intervals:
- The number of class intervals in a frequency is a matter of importance.
- The number of class intervals should not be too many, and an ideal frequency distribution can vary from 5 to 15 class intervals.
- To decide the number of class intervals, choose the lowest and the highest values. The difference between them will enable us to decide the class intervals.
- Can be decided with the help of Sturges' Rule.
- According to Sturges' Rule: K = 1 + 3.322 \log N
  - Where:
    - N = total number of observations
    - \log = logarithm of the number
    - K = number of class intervals
- Example: If the number of observations is 10, then the number of class intervals is K = 1 + 3.322 \log 10 = 4.322
Size of the Class Interval:
- The size of the class interval is inversely proportional to the number of class intervals in a given distribution.
- The approximate value of the size (or width or magnitude) of the class interval C is obtained by using Sturge's rule.
- Size \ of \ class \ interval = C = \frac{Range}{Number \ of \ class \ Internal} = \frac{R}{K}
- C = \frac{R}{1 + 3.322 \log N}
  - Where R = range

Types of Class Intervals

There are three methods of classifying data according to class intervals:
- Exclusive (continuous) method
- Inclusive (Discrete) method
- Open-end classes

Exclusive (Continuous) Method:
- Type of class interval in which the class interval overlaps.
- Example Expenditure and Number of Families:
  Expenditure (\) Number of families
  0-5000 60
  5000-10,000 95
  10000-15000 122
  15000-20000 83
  20000-25000 40
  Total 400
- The first class interval implies all data from 0 to 4999.99; 5000 is not included in the first class but in the second class, and so on.
Inclusive (Discrete) Method:
- In this method, the overlapping of the class intervals is avoided.
- Both the lower and upper limits are included in the class interval.
- Used for a grouped data frequency distribution for discrete variables like members in a family or number of workers in a factory where the variable may take only integral values and cannot be used with fractional values like age, height, or weight.
- Example Distribution:
  Class Interval (C.I) Frequency
  5-9 5
  10-14 7
  15-19 12
  20-24 21
  25-29 10
  30-34 5
  Total 70
- To decide whether to use the inclusive or exclusive method, it is important to determine whether the variable under observation is continuous or discrete.
- In the case of continuous variables, the exclusive method must be used, and the inclusive method should be used in the case of discrete variables.
Open-End Classes:
- A class limit is missing either at the lower end of the first class interval or at the upper end of the last class interval or both, and classes are not specified.
- The necessity of open-end classes arises in practical situations, particularly relating to economics and medical data when there are few very high values or few very low values which are far apart from the majority of observations.
- Example:
  Salary Range Number of workers
  Below 2000 7
  2000-4000 5
  4000-6000 6
  6000-8000 4
  8000 and above 3
  Total 25

Expenditure (\)	Number of families
0-5000	60
5000-10,000	95
10000-15000	122
15000-20000	83
20000-25000	40
Total	400

Class Interval (C.I)	Frequency
5-9	5
10-14	7
15-19	12
20-24	21
25-29	10
30-34	5
Total	70

Salary Range	Number of workers
Below 2000	7
2000-4000	5
4000-6000	6
6000-8000	4
8000 and above	3
Total	25

Preparation of Frequency Table

Example: Given the numbers of tools produced by workers in a factory:
38, 25, 13, 14, 27, 41, 47, 17, 32, 25, 43, 18, 25, 18, 39, 44, 19, 20, 20, 26, 40, 45, 34, 31, 32, 27, 33, 37, 25, 26, 33, 28, 31, 34, 35, 46, 29, 34, 31, 34, 35, 24, 30, 41, 32, 29, 28, 30, 31, 30, 34, 35, 36, 29, 26, 32, 36, 35, 36, 37, 23, 32, 23, 22, 29, 33, 37, 33, 27, 24, 36, 42, 29, 37, 29, 23, 44, 41, 45, 39, 21, 42, 22, 28, 22, 15, 16, 17, 21, 22, 29, 35, 31, 27, 40, 23, 32, 40, 37
Use Sturges' rule to determine the number of class intervals and prepare a frequency distribution table.
- Solution:
  - Number of class intervals: K = 1 + 3.322 \log N = 1 + 3.322 \log 100 = 7.6
  - Class Interval Size: C = \frac{R}{K} = \frac{46 - 13}{7.6} = 4.34 \approx 5
  - Taking C = 5, we have the classes 13-17, 18-22, 43-47 as inclusive types.
  - Frequency Table:
    C.I Tally Frequency
    13-17 H| 6
    18-22 H| | 12
    23-27 H||\ 10
    28-32 H| H| || 14
    33-37 H| H| 11
    38-42 || 7
    43-47 ||\ 4
    Total 100

C.I	Tally	Frequency
13-17	H\|	6
18-22	H\| \|	12
23-27	H\|\|\	10
28-32	H\| H\| \|\|	14
33-37	H\| H\|	11
38-42	\|\|	7
43-47	\|\|\	4
Total		100

Histogram

A histogram is also called a block frequency diagram.
Frequency distribution can be represented in the form of graphs and charts.
Histogram is a continuous distribution, and if the class interval is discrete, we need to adjust it to a continuous one before the histogram is drawn by subtracting 0.5 from lower classes and adding it to upper classes.
The histogram is constructed by plotting the class boundaries frequency against class boundaries.
Example: The scores of thirty students in a statistics examination were given as follows:
126, 145, 137, 145, 140, 146, 131, 143, 127, 133, 134, 144, 136, 135, 128, 130, 137, 142, 141, 139, 147, 149, 150, 148, 146, 150, 148, 151, 153, 155
Use the above information to obtain the histogram of the distribution.
C.I C.B U.C.B F
0-126 0-125.5 125.5 0
126-130 125.5-130.5 130.5 4
131-135 130.5-135.5 135.5 4
136-140 135.5-140.5 140.5 5
141-145 140.5-145.5 145.5 6
146-150 145.5-150.5 150.5 8
151-155 150.5-155.5 155.5 3
155-160 155.5-160.5 160.5 0

C.I	C.B	U.C.B	F
0-126	0-125.5	125.5	0
126-130	125.5-130.5	130.5	4
131-135	130.5-135.5	135.5	4
136-140	135.5-140.5	140.5	5
141-145	140.5-145.5	145.5	6
146-150	145.5-150.5	150.5	8
151-155	150.5-155.5	155.5	3
155-160	155.5-160.5	160.5	0

Frequency Polygon

It is obtained by plotting the midpoints of each class interval and the corresponding frequency of that class.
It can also be obtained by joining the mid-points of the tops of the rectangles of the histogram and extending the line to meet the x-axis.
A polygon drawn will have the same area as the corresponding histogram if the class intervals are the same.

Using the data plot the frequency polygon of the distribution.

C.I	C.B	U.C.B	F	Mid-value
0-126	0-125.5	125.5	0	62.75
126-130	125.5-130.5	130.5	4	128.00
131-135	130.5-135.5	135.5	4	133.00
136-140	135.5-140.5	140.5	5	138.00
141-145	140.5-145.5	145.5	6	143.00
146-150	145.5-150.5	150.5	8	148.00
151-155	150.5-155.5	155.5	3	153.00
156-160	155.5-160.5	160.5	0	158.00

Cumulative Frequency Distribution (Ogive) or (C.F. Curve)

Is obtained by plotting cumulative frequency against the upper-class boundary. It can be used to evaluate the median, quartiles, percentiles, deciles, and interquartile range.
The graph is usually an S shape.
Using the data plot Cumulative frequency curve (ogive) of the distribution.
C. I C. B U.B. C. B F C-F
0-126 0-125.5 125.5 0 0
126-130 125.5-130.5 130.5 4 4
131-135 130.5-135.5 135.5 4 8
136-140 135.5-140.5 140.5 5 13
141-145 140.5-145.5 145.5 6 19
146-150 145.5-150.5 150.5 8 27
151-155 150.5-155.5 155.5 3 30

C. I	C. B	U.B. C. B	F	C-F
0-126	0-125.5	125.5	0	0
126-130	125.5-130.5	130.5	4	4
131-135	130.5-135.5	135.5	4	8
136-140	135.5-140.5	140.5	5	13
141-145	140.5-145.5	145.5	6	19
146-150	145.5-150.5	150.5	8	27
151-155	150.5-155.5	155.5	3	30

Errors and Approximations

Statistical errors are the difference between the actual magnitude of the object in question and the magnitude of the estimation of the objects given by the enumerator or researcher.
For example, an investigator estimated that 4,832 people use a particular toothpaste in an area, but the actual number of people that use the toothpaste is 5,241.
- Then statistical error = actual number of people - estimated number of people:
  5241 - 4832 = 409
Statistical error is not the same as the errors in the process of calculating your estimated error.

Causes of Statistical Error

Errors due to the measuring instruments used; some instruments used may not have the capacity to give accurate measurements.
Errors due to the inability of the researcher to correctly use the instrument or improper units of measurements.
Errors due to wrong information supplied by the respondents themselves.
Errors as a result of small samples that may not be a true representative of the population concerned.
Errors due to unnecessary approximations of the measurements of objects.

Measurement of Statistical Errors

Absolute Error:
- This is given by subtracting the estimated value from the actual value.
  Absolute \ error = Actual \ value - Estimated \ value
- Example: The estimated no of employees that will resign from his/her employment after 10 years of service is 15. The actual number that resigns is 17.
  A.E = 17 - 15 = 2
Relative Error (or percentage error):
- This is the actual error committed (Absolute error) divided by the estimated value. When this proportion is multiplied by 100 it becomes a percentage error.
  R.E = \frac{Absolute \ error}{Estimated \ value}
  P.\epsilon = \frac{A.E}{E.V} \times 100\%

Measures of Central Tendency

Measures of central tendency or measures of location, simply called averages, are widely used statistical measures.
It is the measure of locating a central value, which has the tendency of other values in the distribution clustering around it.
The measure is very important in the sense that such value when determined can be considered to be the representative of the group.
The five measures of central tendency are:
1. Arithmetic mean or simple mean.
2. Median
3. Mode
4. Geometric mean
5. Harmonic mean

Arithmetic Mean

Defined as the sum of the observations divided by the number of observations.
If the variable x assumes values X1, X2, X3, …, Xn, then the mean \bar{X} is given by:
\bar{X} = \frac{X1 + X2 + X3 + … + Xn}{n} = \frac{\sum{i=1}^{n} Xi}{n}
- This formula is for the ungrouped or raw data.
- Example: Calculate the mean for 2, 4, 6, 8, 10
  \bar{X} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6

Grouped Data

The mean for grouped data is obtained from the following formula:
\bar{X} = \frac{\sum{i=1}^{n} fi xi}{\sum{i=1}^{n} f_i}
- Where:
  - x_i = midpoint of individual class
  - f_i = the frequency of the Individual class
  - \sum f_i = the sum of the frequencies or total frequencies
Example: Given the following frequency distribution, calculate the arithmetic mean:
Mark No. of students
64 8
63 18
62 12
61 9
60 7
59 6
\bar{X} = \frac{\sum fx}{N} = \frac{3713}{60} = 61.88
Example: Calculate the arithmetic mean of the marks from the following table:
Marks Number of Students
0-10 12
10-20 18
20-30 27
30-40 20
40-50 17
\bar{X} = \frac{\sum fx}{N} = \frac{2470}{94} = 26.28

Mark	No. of students
64	8
63	18
62	12
61	9
60	7
59	6

Marks	Number of Students
0-10	12
10-20	18
20-30	27
30-40	20
40-50	17

Median

The median is that value of the variable that divides the group into two equal parts, one part comprising all values greater, and the other, all values less than the median.

Ungrouped or Raw Data

Arrange the given values in increasing or decreasing order. If the number of values are odd, the median is the middle value. If the number of values are even, the median is the average (mean) of the middle two values.
- Example: Find the median of 2, 1, 4, 3, 6, 5
  - Solution: 1, 2, 3, 4, 5, 6
    Median = \frac{3 + 4}{2} = 3.5

Median of Discrete Frequency Distribution

In the case of a discrete frequency distribution, the median is obtained by considering the Cumulative frequencies. The steps for calculating the median are given below:
1. Find \frac{N}{2}, where N = \sum f
2. See the Cumulative frequency (C.F) just greater than \frac{N}{2}
3. The Corresponding value of the x is median.
Example: Obtain the median for the following frequency distribution:
| x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| :- | - | - | - | -- | -- | -- | -- | -- | -- | -- | -- |
| f | 3 | 4 | 8 | 10 | 16 | 20 | 25 | 15 | 9 | 6 | 4 |
- Solution: N = 120, \frac{N}{2} = 60
  - The value of x corresponding to 65 is 5. Therefore, the median is 5.

Median for Grouped Data

For grouped data, the median is defined as:
Median = Lm + (\frac{\frac{n}{2} - cf{bm}}{f_m}) \times c
- Where,
  - L_m = lower class boundary of the median class
  - cf_{bm} = Cumulative frequency before the median class
  - f_m = Frequency of the median class
  - c = class size or width
  - \frac{n}{2} = The median position (to help identify the median class)
Example The errors discovered in the lengths of rods produced in a factory in (millimeter) are given below :
Errors in length (mm) 19-21 22-24 25-27 28-30 31-33 34-36 37-39
Number of rods 9 12 18 23 19 13 6
Estimate the median error in the length of rods.
Median = Lm + (\frac{\frac{n}{2} - cf{bm}}{f_m}) \times c = 27.5 + (\frac{50-39}{23})\times 3 = 27.5 + 1.43 = 28.93

Errors in length (mm)	19-21	22-24	25-27	28-30	31-33	34-36	37-39
Number of rods	9	12	18	23	19	13	6

Mode

For ungrouped data, the mode is the value that occurs most frequently.
However, it is easy to understand, it may not be unique or clearly defined, as some distributions may have more than one mode.
A distribution with one mode is called a unimodal distribution; a distribution with two modes is called a bimodal distribution; and a distribution with more than three modes is referred to as a multi-modal distribution.
Examples:
Find the mode of the following distributions;
14, 19, 16, 21, 18, 24, 15 ad 19
- The mode is 19 (unimodal distribution)
Mode from Frequency Data: In frequency data, the mode is the number with the highest frequency.
Example:
Find the mode of the distribution x: 1, 2, 3, 4, 5, 6, 7, 8, 9 and frequency f: 4, 9, 16, 25, 12, 15, 7, 3, 1
- The value of x corresponding to the maximum frequency of 25 is 4.

Mode for Grouped data

When data are grouped, the mode can be obtained using the following formula:
mode = Lm +(\frac{\Delta1}{\Delta1+\Delta2})\times c
- Where:
  - L_m = lower class boundary of the modal class
  - \Delta_1 = the difference between modal class frequency and the frequency of the next upper class