Chapter 2: Descriptive Statistics and Data Visualization Patterns and Data Organization and Visualization

Introduction to Descriptive Statistics

  • Objective of Descriptive Statistical Methods: The primary aim is to present information and data in a manner that is clear, concise, and accurate.

  • The Problem of Data Volume: Analyzing large data sets is difficult because there is often too much information for the human mind to assimilate or understand.

  • Task of Descriptive Methods: These methods summarize information and extract main features without distorting the overall picture.

Organizing and Graphing Data

  • Samples and Populations: All data sets used in this context are regarded as samples drawn from a specific population.

  • Purpose of Samples: A sample is studied mainly to obtain information about the larger population.

  • Focus of Study: The main goal involves summarizing and describing specific features of the data set.

  • Raw Data (Ungrouped Data):

    • Definition: Information obtained from each member of a population or sample and recorded in the sequence in which it becomes available.

    • Characteristics: Collected at random; not organized or ranked.

Frequency Distributions

  • Definition: A frequency distribution is a table where data are grouped into classes, and the number of values (frequencies) falling in each class is recorded.

  • Purpose: To gain insight into the distribution pattern of frequencies across different classes; the name refers specifically to this pattern.

Example 1: Survey of Urban Neighborhood Families
  • Context: A survey of 4040 families recorded the number of children per family.

  • Sample Size (nn): 4040

  • Raw Data: 1,0,3,2,1,5,6,2,2,1,0,3,4,2,1,6,3,2,1,5,3,3,2,4,2,2,3,0,2,1,4,5,3,3,4,4,1,2,4,51, 0, 3, 2, 1, 5, 6, 2, 2, 1, 0, 3, 4, 2, 1, 6, 3, 2, 1, 5, 3, 3, 2, 4, 2, 2, 3, 0, 2, 1, 4, 5, 3, 3, 4, 4, 1, 2, 4, 5

  • Frequency Distribution Table Results:

    • 00 children: Frequency (ff) = 33; Relative Frequency (RfRf) = 3/40=0.0753/40 = 0.075

    • 11 child: Frequency (ff) = 77; Relative Frequency (RfRf) = 7/40=0.1757/40 = 0.175

    • 22 children: Frequency (ff) = 1010; Relative Frequency (RfRf) = 10/40=0.2510/40 = 0.25

    • 33 children: Frequency (ff) = 88; Relative Frequency (RfRf) = 8/40=0.28/40 = 0.2

    • 44 children: Frequency (ff) = 66; Relative Frequency (RfRf) = 6/40=0.156/40 = 0.15

    • 55 children: Frequency (ff) = 44; Relative Frequency (RfRf) = 4/40=0.14/40 = 0.1

    • 66 children: Frequency (ff) = 22; Relative Frequency (RfRf) = 2/40=0.052/40 = 0.05

    • Total Frequency: σf=n=40\sigma f = n = 40

Relative Frequency Calculations
  • Definition: The proportion (or percentage) of data falling into a specific class.

  • Formula:   Rf=Class FrequencySample Size=fnRf = \frac{\text{Class Frequency}}{\text{Sample Size}} = \frac{f}{n}

  • Note: The sum of frequencies always equals the sample size (nn).

Example 2: Smartphone Ownership (Qualitative Data)
  • Context: Type of smartphone owned by the youngest family member.

  • Categories: AA = Android, II = iPhone, WW = Windows phone.

  • Raw Data: A,W,I,I,W,I,A,A,A,I,W,A,W,A,I,A,I,I,A,W,W,I,A,A,I,W,A,I,A,I,W,I,I,W,A,I,A,A,A,WA, W, I, I, W, I, A, A, A, I, W, A, W, A, I, A, I, I, A, W, W, I, A, A, I, W, A, I, A, I, W, I, I, W, A, I, A, A, A, W

  • Frequency Distribution Results:

    • Android: Frequency = 1616; Rf=0.4Rf = 0.4; Cumulative Frequency (FF) = 1616

    • iPhone: Frequency = 1414; Rf=0.35Rf = 0.35; Cumulative Frequency (FF) = 3030

    • Windows phone: Frequency = 1010; Rf=0.25Rf = 0.25; Cumulative Frequency (FF) = 4040

Constructing Frequency Distributions for Grouped Data

Case Study: Staff Lunch Spending at DUT
  • Data Set: Amounts spent in Rands by 4040 lecturers.

  • Values: 38,64,50,32,44,25,49,57,46,58,40,47,36,48,52,44,68,26,38,78,63,21,54,65,46,73,42,47,35,53,40,35,61,45,35,42,50,56,45,2838, 64, 50, 32, 44, 25, 49, 57, 46, 58, 40, 47, 36, 48, 52, 44, 68, 26, 38, 78, 63, 21, 54, 65, 46, 73, 42, 47, 35, 53, 40, 35, 61, 45, 35, 42, 50, 56, 45, 28

  • Statistics: Highest value = 7878, Lowest value = 2121, n=40n = 40.

Major Steps for Construction:
  1. Calculate Number of Classes (kk):

    • Formula: k=3.3log(n)+1k = 3.3\log(n) + 1

    • Calculation: k=3.3log(40)+1=6.28686k = 3.3\log(40) + 1 = 6.2868 \approx 6

  2. Calculate Class Width (WW):

    • Formula: W=RangekW = \frac{\text{Range}}{k}

    • Range (RR): 7821=5778 - 21 = 57

    • Calculation: W=576=9.510W = \frac{57}{6} = 9.5 \approx 10

  3. Choose Starting Point:

    • Rule: Use any convenient number equal to or less than the smallest value in the data set as the lower limit of the first class (2020 was chosen for this example).

Grouped Frequency Distribution Table:
  • Classes:

    • 20 - < 30: f=4f = 4, Rf=4/40Rf = 4/40, F=4F = 4, Midpoint (xx) = 2525

    • 30 - < 40: f=7f = 7, Rf=7/40Rf = 7/40, F=11F = 11, Midpoint (xx) = 3535

    • 40 - < 50: f=14f = 14, Rf=14/40Rf = 14/40, F=25F = 25, Midpoint (xx) = 4545

    • 50 - < 60: f=8f = 8, Rf=8/40Rf = 8/40, F=33F = 33, Midpoint (xx) = 5555

    • 60 - < 70: f=5f = 5, Rf=5/40Rf = 5/40, F=38F = 38, Midpoint (xx) = 6565

    • 70 - < 80: f=2f = 2, Rf=2/40Rf = 2/40, F=40F = 40, Midpoint (xx) = 7575

  • Percentage Frequency: Calculated as Rf×100Rf \times 100.

  • Cumulative Frequency (FF): The frequency of a class plus all previous frequencies. For the first class, FF always equals the frequency (ff).

  • Midpoint (Class Mark): The average of the two class limits.

    • Formula: Midpoint=lower class limit+upper class limit2\text{Midpoint} = \frac{\text{lower class limit} + \text{upper class limit}}{2}

    • Example: M1=20+302=25M_1 = \frac{20 + 30}{2} = 25

Exercises for Practice

Exercise 1: Call Center Performance
  • Metric: Service level (time in seconds to answer calls).

  • Data (n=50n = 50): 16,14,16,19,6,14,15,5,16,18,17,22,6,18,10,15,12,6,19,16,16,15,13,25,9,17,12,10,5,15,23,11,12,14,24,9,10,13,14,26,19,20,13,24,28,15,21,8,16,1216, 14, 16, 19, 6, 14, 15, 5, 16, 18, 17, 22, 6, 18, 10, 15, 12, 6, 19, 16, 16, 15, 13, 25, 9, 17, 12, 10, 5, 15, 23, 11, 12, 14, 24, 9, 10, 13, 14, 26, 19, 20, 13, 24, 28, 15, 21, 8, 16, 12

  • Task: Construct a frequency and cumulative distribution.

Exercise 2: Public Transport Spending
  • Data: Amount spent per day by 5050 DUT staff members.

  • Statistics: Highest = R64R64, Lowest = R39R39.

  • Data Set: 57,39,52,52,43,50,53,42,58,55,58,50,53,50,49,45,49,51,44,54,49,57,55,64,45,50,45,51,54,58,53,49,52,51,41,52,40,44,49,45,43,47,47,43,51,55,55,46,54,4157, 39, 52, 52, 43, 50, 53, 42, 58, 55, 58, 50, 53, 50, 49, 45, 49, 51, 44, 54, 49, 57, 55, 64, 45, 50, 45, 51, 54, 58, 53, 49, 52, 51, 41, 52, 40, 44, 49, 45, 43, 47, 47, 43, 51, 55, 55, 46, 54, 41

  • Task: Construct a frequency and cumulative distribution.

Graphing Grouped Data

Histogram
  • Definition: Graphical representation of a frequency distribution.

  • Construction:

    • Horizontal axis: Class boundaries (true class limits ensuring continuity).

    • Vertical axis: Frequencies, relative frequencies, or percentages.

    • Representation: Rectangular bars with class boundaries as the base and frequency as the height.

    • Gaps: There are no gaps between bars because they are drawn over class boundaries.

  • Insight: From the DUT lunch spending histogram, it is visible that most staff members spent between 4040 and 5050 Rand.

Shape of a Distribution
  • Purpose: Histograms describe the clustering pattern of data values.

  • Normal Distribution: Bell-shaped; Mean=Median=Mode\text{Mean} = \text{Median} = \text{Mode}.

  • Right Skewed: Tail extends to the right; \text{Mode} < \text{Median} < \text{Mean}.

  • Left Skewed: Tail extends to the left; \text{Mean} < \text{Median} < \text{Mode}.

  • Central Location Rules:

    • If data is normally distributed, use the Mean.

    • If data is not normally distributed, use the Median.

Frequency Polygon
  • Definition: A line graph emphasizing continuous change in frequencies.

  • Construction:

    • Plot class midpoints against frequencies.

    • Add two additional classes (one at each end) with zero frequency to anchor the graph to the horizontal axis.

    • Join adjacent points with straight lines.

  • Equivalence: The histogram and frequency polygon are equivalent; the area under both represents the total number of observations (nn).

Cumulative Frequency Graph (Ogive)
  • Definition: A graph of cumulative frequencies versus upper-class boundaries.

  • Construction:

    • Horizontal axis: Upper class boundary.

    • Vertical axis: Cumulative frequency (FF).

    • Starting Point: Plot 00 against the lower-class boundary of the first class.

    • Example coordinates: (20,0),(30,4),(40,11),(50,25),(60,33),(70,38),(80,40)(20, 0), (30, 4), (40, 11), (50, 25), (60, 33), (70, 38), (80, 40).

Graphing Qualitative Data Sets

Bar Charts
  • Definition: Categorical data represented by rectangular bars.

  • Features: Bars can be vertical or horizontal. Height/length is proportional to the size of the category. Only totals are represented.

  • Example (Facebook Users, US 2011):

    • 132513 - 25 Age Group: 65,082,28065,082,280 users (45%45\%

    • 264426 - 44 Age Group: 53,300,20053,300,200 users (36%36\%

    • 456445 - 64 Age Group: 27,885,10027,885,100 users (19%19\%

    • Total users: 146,267,580146,267,580

  • Example (Commerce Student Distinctions):

    • 00 As: 22 students; 11 A: 66 students; 22 As: 99 students; 33 As: 44 students; 44 As: 33 students.

Pie Chart
  • Definition: A circle divided into slices proportional to the frequency of subgroups.

  • Calculations:

    • Angle: Angle=Number of items in categoryTotal number×360\text{Angle} = \frac{\text{Number of items in category}}{\text{Total number}} \times 360^\circ

    • Percentage: Percentage=Angle360×100\text{Percentage} = \frac{\text{Angle}}{360^\circ} \times 100

  • Results for Commerce Students (Total n=24n=24):

    • 00 As: 3030^\circ (8.3%8.3\%

    • 11 A: 9090^\circ (25.0%25.0\%

    • 22 As: 135135^\circ (37.5%37.5\%

    • 33 As: 6060^\circ (16.7%16.7\%

    • 44 As: 4545^\circ (12.5%12.5\%

  • Results for Facebook Age Groups:

    • 132513 - 25: 45100×360=162\frac{45}{100} \times 360 = 162^\circ

    • 264426 - 44: 36100×360=129.6\frac{36}{100} \times 360 = 129.6^\circ

    • 456445 - 64: 19100×360=68.4\frac{19}{100} \times 360 = 68.4^\circ