Univariate Data Study Notes
Univariate Data Study Notes
Chapter 1: Understanding Univariate Data
1. Learning Intentions and Success Criteria
Learning Intention: Students will categorise data into the various types.
Success Criteria:
I can identify categorical (nominal and ordinal) and numerical (discrete and continuous) data.
2. Types of Data
Categorical Data
Categorical data consists of named categories or groups.
Types of Categorical Data:
Nominal: No real order; categories can be divided into sub-groups (e.g., fruits, types of transports).
Ordinal: Has a natural order; can be sorted from low to high.
Numerical Data
Numerical data consists of numerical values that can be averaged or calculated with.
Types of Numerical Data:
Discrete: Counted data, such as the number of siblings or pets, attendance, typically represented by whole numbers (e.g., 1, 2, 3…).
Continuous: Measured data, such as length or height in cm/m/km, weight in g/kg, which can include decimals but not always.
3. Exam Example
2023 Exam 2:
Question 1a: Data was collected on electronic images to automate sizing of oysters for sale.
Variables in Study:
ID: Identity number of the oyster.
Weight: Weight of the oyster in grams (g).
Volume: Volume of the oyster in cubic centimeters (cm³).
Image Size: Size determined from its electronic image (in megapixels).
Size: Oyster size when offered for sale (small, medium, large).
Sample Data for 15 Oysters:
| ID | Weight (g) | Volume (cm³) | Image Size (megapixels) | Size |
|----|------------|---------------|-------------------------|--------|
| 1 | 12.9 | 13.0 | 5.1 | large |
| 2 | 11.4 | 11.7 | 4.8 | medium |
| 3 | 174 | 174 | 65 | large |
Part a: Categorical Variables
Task: Write down the number of categorical variables in Table 1.
Marks: 1 mark.
4. Categorical Data Analysis
4.1 Frequency Tables
Creating Frequency Tables:
Determine if a percentage column is needed, or only frequency.
Use tally methods for accuracy.
Verify that the total frequency adds up correctly.
4.2 Bar Charts
Constructing Bar Charts:
Ensure bars do not touch each other and the y-axis.
Label axes clearly. Use a ruler for neatness.
Multiple Groups:
Bars may touch if displaying grouped comparisons. Use different colors or patterns, and include a key.
4.3 Segmented Percentage Bar Charts
Display Characteristics:
Percentage values displayed on the y-axis.
Sections stacked within a single bar. Ensure a key is present to indicate segment categories.
4.4 Interpreting Categorical Data
Calculating Mode:
Only type of average available for categorical data as it doesn’t provide numerical calculations like mean or median.
Mode reflects the most common or frequently occurring category.
Analysis:
Describe and interpret data displays carefully, relying on actual data analysis rather than personal opinion.
Example Interpretation
The bar chart illustrates that blue is the most common eye color for students in Year 7 (34%) and Year 8 (28%).
5. Numerical Data Analysis
5.1 Learning Intentions and Success Criteria
Learning Intention: Students will analyze numerical data.
Success Criteria:
Create and interpret histograms (by hand and on CAS).
Analyze shape, skew, outliers, and spread.
5.2 Interpreting Numerical Data
Important aspects to mention:
Mode or modal category.
Median, mean (center).
Range or interquartile range (spread).
Frequencies or percentages.
Describe the shape/distribution/skew of the data.
5.3 Histograms
5.3.1 Discrete Data Histograms
Bars should touch but begin at zero on the y-axis. Each bar represents a single numerical value; labels centered under bars.
5.3.2 Continuous Data Histograms
Similar to discrete data, but data will be grouped into ranges.
Ideal grouping should consist of about 6 to 10 equal-sized groups.
5.4 Describing Numerical Distributions
5.4.1 Shape and Symmetry
Distributions can be:
Symmetrical: No skew.
Negative Skew: Higher end cluster with a tail to the lower end.
Positive Skew: Lower end cluster with a tail to the higher end.
5.4.2 Spread
Range: Measures variability; calculated as highest value - lowest value.
Interquartile Range (IQR): Spread of the middle 50% of the data; calculated via $IQR = Q3 - Q1$.
6. Measures of Center and Spread
6.1 Measures of Spread
6.1.1 Standard Deviation
Measures the data's deviation from mean.
Low standard deviation: data clustered around the mean.
High standard deviation: data is more spread out.
6.2 Measures of Center
6.2.1 Mean
Commonly referred to as the average; influenced by skew and outliers.
Calculation: sum of all values divided by number of values.
6.2.2 Median
The middle value when data is listed in order.
Positions used to find median:
If n is even: Average values at positions $\frac{n+1}{2}$ and $\frac{n}{2}$.
If n is odd: Use $\frac{n+1}{2}$ position directly.
7. Boxplots and Outliers
7.1 Five Number Summary
Components: Minimum, Q1, Median, Q3, Maximum.
7.2 Boxplots Creation
Represent data via five-number summary, creating visual displays that highlight the center and spread of distributions, along with any outliers.
8. Normal Distribution
8.1 Characteristics
Normal distributions are approximately bell-shaped and symmetrical.
8.2 68-95-99.7% Rule
68% of data within one standard deviation, 95% within two, 99.7% within three standard deviations from the mean.
9. Log Transformation and Histograms
9.1 Logarithm Basics
Convert between original values and logarithmic values.
9.2 Creating Histograms with Log Scales
Necessary for handling large ranges of data for accuracy in observations and interpretations.
10. Conclusion
Understanding univariate data is crucial for effective statistical analysis, allowing for proper interpretation and visual representation of data sets across various contexts.