COMM 1503 Types of Data

Introduction to Data and Context

  • Data: Refers to the information collected for analysis.

    • Population Data: Data that represents an entire population.

    • Sample Data: Data drawn from a subset of the population.

  • Types of Data:

    • Survey, Census, Parameter: concepts related to data gathering and analysis.

    • Categorical Data: Data representing categories or distinct groups (e.g., types of drinks).

    • Quantitative Data: Numerical data that can be measured (e.g., heights, ages).

  • Objective of the Class:

    • Discuss the utility and summarization of data provided.

    • Explore frequency distributions and their importance in analyzing data.

Frequency Distribution

  • Definition of Frequency Distribution:

    • A summary that shows all possible values in a dataset and their respective counts.

    • Helps to identify how often each value appears.

  • Goals of a Frequency Distribution:

    • Present every value that occurs in the dataset.

    • Count how many times each value appears in the data.

  • Types of Frequency Distribution:

    • For both numerical and categorical data, the frequency distribution is typically the first step in summarizing data.

  • Example of Frequency Distribution:

    • If analyzing drinks sold, the distribution would show:

    • What types of drinks were sold

    • How many times each kind was sold.

Calculating Frequency Distributions

  • Methods for Calculation:

    • Calculations can be performed by hand or using Excel. Basic counting suffices for manual calculations.

  • Using Excel for Frequency Distribution:

    • Function COUNTIF: Counts the number of cells that meet a specific criterion.

    =COUNTIF(A2:B26, criteria)
    
    • Using dollar signs ($) in formulas to fix certain values when dragging formulas down rows/columns to ensure consistent referencing in calculations.

  • Total Count and Frequency Analysis:

    • The total frequency count should match the number of observations in the dataset, e.g., total drinks sold = 50.

  • Importance of Accuracy:

    • Verify that the number of counts aligns with the actual data collected to maintain data integrity.

Relative Frequency and Percent Frequency

  • Relative Frequency:

    • Represents the proportion of a total that a particular value contributes.

    • Calculated as:
      Relative Frequency=FrequencyTotal Frequency\text{Relative Frequency} = \frac{\text{Frequency}}{\text{Total Frequency}}

  • Percent Frequency:

    • Similar to relative frequency but expressed as a percentage.

    • Calculated as:
      Percent Frequency=Relative Frequency×100\text{Percent Frequency} = \text{Relative Frequency} \times 100

  • Summation of Frequencies:

    • Absolute frequency totals should equal the total observations, while relative frequency should always sum to 1.

Analyzing Quantitative Data

  • Defining Ranges for Bins:

    • Rather than focusing on exact quantities (e.g., exact heights), the objective is often to classify data into ranges (bins).

  • Choosing the Number of Bins:

    • Decisions on bin count (typically between 5 to 20) are stylistic and depend on the analyst's judgment.

  • Calculating Bin Width:

    • Formula:
      Bin Width=Largest ValueSmallest ValueNumber of Bins\text{Bin Width} = \frac{\text{Largest Value} - \text{Smallest Value}}{\text{Number of Bins}}

    • Example: For binning heights with a range of values, if the largest height is 73 inches and smallest is 58 inches, the width for 5 bins would be:
      73585=3\frac{73 - 58}{5} = 3

  • Understanding Edge Cases in Binning:

    • Decisions must be made consistently about including or excluding boundary values when establishing bins to ensure non-overlapping categories.

    • Use square brackets [ ] for inclusivity and round brackets ( ) for exclusivity in bin limits.

Histograms and Their Interpretation

  • Definition of Histogram:

    • A visual representation of frequency distribution where horizontal (x-axis) represents various bins while the vertical (y-axis) represents frequency, relative frequency, or percent frequency.

    • Unlike bar charts used for categorical data, histograms represent continuous data without spaces between bars, indicating data continuity.

  • Shape of Histograms:

    • Analysis of shape is crucial; look for symmetry or skewness.

    • Definitions:

    • Symmetrical Distribution: Both sides of the histogram mirror each other.

    • Skewed Distribution: One side is longer than the other:

      • Right-Skew: Most data is clustered on the left, tail to the right.

      • Left-Skew: Most data is clustered on the right, tail to the left.

Cumulative Frequency Distribution

  • Cumulative Distribution Explained:

    • Represents the accumulation of frequencies, tallying totals up to specific bin limits.

    • Cumulative frequency is calculated as:

    • For each bin, add the frequency of that bin to the cumulative total of the previous bins.

  • Applications of Cumulative Frequency:

    • Particularly useful for determining how many observations fall below a certain value without concern for the precise distribution beyond that point.

  • Limitations:

    • Cumulative distributions are not applicable for categorical data, as they rely on the assumption of order and continuity, which categorical data lacks.

Conclusion

  • Recap of Frequency Distributions:

    • Apply concepts of absolute frequency, relative frequency, and percent frequency to analyze data.

    • Histograms visualize distribution effectively, revealing insights about shape and skewness.

    • Understand the role of cumulative frequency in summing observations to establish overall trends in data analysis.