Chapter 1: Investigating Data Distributions

0.0(0)
studied byStudied by 2 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/57

flashcard set

Earn XP

Description and Tags

Investigating data distributions.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

58 Terms

1
New cards

Types of Data: Categorical (Qualitative) Data

Variables that represent qualities or groupings. Data that cannot be measured in the form of numbers and sorted into categories. Descriptive, interpretation-based.

2
New cards

Types of Data (under Categorial Data): Nominal Variables

Used to group individuals according to a particular characteristic (there is no clear order to the categories).

E.g., Colour of Hair, Nationality, Eye Colour)

Distributed into distinct categories and defined categories

3
New cards

Types of Data (under Categorial Data): Ordinal Variables

Data values that can be used to both group and order individuals according to a particular characteristic (include an order to the categories).

Cannot take an average of and can include numerical values!

E.g., Year, Postcode, Bankcard number, 1st/2nd/3rd

4
New cards

Types of Data: Numerical (Quantitative) Data

Variables that represent quantities and can be expressed in numerical values. Commonly used in statistical manipulation. Numbers-based, countable.

5
New cards

Types of Data (under Numerical Data): Discrete Variables

Countable, has an end.

Whole numbers, decimals/fractions (integers) (if there is a specific set, e.g., halves or quarters)

E.g., Number of _, total students in a class

6
New cards

Types of Data (under Numerical Data): Continuous Variables

Infinite, cannot be counted with an end.

They are infinitely divisible units collected by measuring (increasing level of accuracy)

Decimals, fractions, and whole numbers.

E.g. Height, weight, capacity, temperature, rates, time, etc.

7
New cards

Variables and data types

Depending on the working of the question, some variables can be numerical or categorical.

Tip: If you take the average of the data, does it have any real meaning?

If yes, it is numerical, and if no, it is categorical.

8
New cards

Conversion of one type of data to another

Numerical:

A fisheries inspector records the lengths of 40 cod.

Categorical:

A fisheries inspector sorts the 40 cod into small, medium, and large size fish.

9
New cards

Univariate data vs Bivariate data

Univariate Data:

  • Analysing data on a single variable at a time

    • Explore each variable in data set separately.

    • Describe centre and spread

  • Examples:

    • Histograms

    • Stem-and-leaf plot

    • Box and whisker plot

    • Dot Plotthe data

    • Pie Charts

Bivariate Data:

  • Analyse two variables

  • Explain the cause and relationship (comparison) between an independent and dependent variable.

  • Examples:

    • Scatterplots

    • Stacked bar charts

    • Line Graph

10
New cards

Frequency Table: Definition

Lists the values of variables in a dataset and how often they occur.

  • Recorded as:

    1. Number Frequency: How often a value occurs.

    2. Percentage Frequency: Percentage of times a value occurs.

11
New cards

Frequency Table: Calculation

  • Frequency (Number) = How often a value occurs

  • Percentage Frequency (%) =
    (How often a value occurs / Total number of values) × 100%

12
New cards

Bar Charts

Compares items either among categories or over time.

Components:

  1. Y-axis: Labelled as frequency or percentage frequency.

  2. X-axis: Labelled with the given variable.

  3. Bars:

    • Height represents frequency or percentage.

    • Drawn with gaps to show distinct categories.

    • One bar = one category.

  4. Intervals: Use equal increments.

  5. Title: Clearly describes the data.

<p><strong>Compares items either among categories or over time.</strong></p><p><strong>Components:</strong></p><ol><li><p><strong>Y-axis</strong>: Labelled as frequency or percentage frequency.</p></li><li><p><strong>X-axis</strong>: Labelled with the given variable.</p></li><li><p><strong>Bars</strong>:</p><ul><li><p>Height represents frequency or percentage.</p></li><li><p>Drawn with gaps to show distinct categories.</p></li><li><p>One bar = one category.</p></li></ul></li><li><p><strong>Intervals</strong>: Use equal increments.</p></li><li><p><strong>Title</strong>: Clearly describes the data.</p></li></ol><p></p>
13
New cards

Segmented Bar Chart

Variation of a bar chart showing composition over time.

Bars are stacked, with segments representing different categories.

Segment lengths reflect frequencies; total height gives overall frequency.

  • Include:

    1. Title

    2. Equal y-axis intervals

    3. Legend/Key to identify segments

    4. Colors/patterns for clarity

<p>Variation of a bar chart showing <strong>composition over time</strong>.</p><p>Bars are <strong>stacked</strong>, with segments representing different categories.</p><p><strong>Segment lengths</strong> reflect frequencies; total height gives overall frequency.</p><ul><li><p>Include:</p><ol><li><p><strong>Title</strong></p></li><li><p>Equal <strong>y-axis intervals</strong></p></li><li><p><strong>Legend/Key</strong> to identify segments</p></li><li><p><strong>Colors/patterns</strong> for clarity</p></li></ol></li></ul><p></p>
14
New cards

Percentage Segmented Bar Chart

A bar chart where segment lengths represent percentages.

Total bar height is always 100%.

  • Features:

    1. X-axis: Variable being compared.

    2. Y-axis: Percentage (%).

    3. Include Title.

    4. Use a key/legend to code and identify segments.

<p>A bar chart where <strong>segment lengths</strong> represent <strong>percentages</strong>.</p><p>Total <strong>bar height</strong> is <strong>always 100%</strong>.</p><ul><li><p>Features:</p><ol><li><p><strong>X-axis</strong>: Variable being compared.</p></li><li><p><strong>Y-axis</strong>: Percentage (%).</p></li><li><p>Include <strong>Title</strong>.</p></li><li><p>Use a <strong>key/legend</strong> to code and identify segments.</p></li></ol></li></ul><p></p>
15
New cards

BIVARIATE Categorical data - Segmented bar chart

Used to chart two categorical variables. An extended version of single segmented bar chart.

<p><span>Used to chart <u>two categorical variables. An extended version of </u><strong><u>single segmented bar chart.</u></strong></span></p>
16
New cards

Writing a Report Describing Categorical Data

  • Context: Summarise data collection and number of individuals.

  • Dominant category (mode): State frequency or percentage (or state if no mode exists).

  • Order & Importance: Rank categories and their relative importance.

  • Other Frequencies: 2nd highest and lowest frequency categories.

  • Recommendations: Suggestions and further analysis needed.

    Note: It is important to use descriptive words relating to the frequency and different categories.

<ul><li><p><strong>Context</strong>: Summarise data collection and number of individuals.</p></li><li><p><strong>Dominant category (mode)</strong>: State frequency or percentage (or state if <u>no mode exists</u>).</p></li></ul><ul><li><p><strong>Order &amp; Importance</strong>: Rank categories and their relative importance.</p></li><li><p><strong>Other Frequencies</strong>: 2nd highest and lowest frequency categories.</p></li><li><p><strong>Recommendations</strong>: Suggestions and further analysis needed.</p><p><span><strong>Note: </strong>It is important to use <strong><em>descriptive words </em></strong>relating to the frequency and different categories.</span></p></li></ul><p></p>
17
New cards

Numerical Univariate Data - Grouped frequency table

  • Use " - <*" notation to create non-overlapping class intervals.

  • Typically, 5 to 15 intervals are used.

  • Example: 0 - <10, 30.0 - 34.9.

  • Ensures data is grouped without overlap.

*The upper boundary is not included, ensuring no overlap between intervals.

18
New cards

Numerical Univariate Data - Histogram

  • Y-axis: Frequency (count or percentage).

  • X-axis: Values of the variable, with each bar corresponding to a data interval.

  • Bars have no gaps between them (unless data starts at zero).

  • The title must be included.

  • Shows the distribution of a single variable.

<ul><li><p><strong>Y-axis</strong>: Frequency (count or percentage).</p></li><li><p><strong>X-axis</strong>: Values of the variable, with each bar corresponding to a data interval.</p></li><li><p>Bars have no gaps between them (unless data starts at zero).</p></li><li><p><strong>The title</strong> must be included.</p></li><li><p>Shows the distribution of a <strong>single variable.</strong></p><p></p></li></ul><p></p>
19
New cards

Numerical Univariate Data - Dot plots

  • Similar to a histogram but uses dots for each data point.

  • Suitable for discrete numerical data and small datasets (<20 data points).

  • The title must be included.

  • Ideal for visualising individual data points.

  • No y-axis, but dots must be evenly spaced to provide the same visual effect as a y-axis.

<ul><li><p>Similar to a histogram but uses dots for each data point.</p></li><li><p>Suitable for <strong>discrete numerical data</strong> and small datasets (&lt;20 data points).</p></li><li><p><strong>The title</strong> must be included.</p></li><li><p>Ideal for visualising individual data points.</p></li><li><p><span>No y-axis, but dots <strong><u>must be evenly spaced </u></strong>to provide the same visual effect as a y-axis.</span></p></li></ul><p></p>
20
New cards

Numerical Univariate Data - Stem-and-leaf plots

  • Three types:

    1. Standard stem-and-leaf plot

    2. Split stem

    3. Back-to-back stem-and-leaf plot

  • Structure:

    • Stem: Leading digit(s)

    • Leaf: Trailing digit (always singular)

  • Key must always be included for clarity.

<ul><li><p>Three types:</p><ol><li><p>Standard stem-and-leaf plot</p></li><li><p>Split stem</p></li><li><p>Back-to-back stem-and-leaf plot</p></li></ol></li><li><p><strong>Structure</strong>:</p><ul><li><p><strong>Stem</strong>: Leading digit(s)</p></li><li><p><strong>Leaf</strong>: Trailing digit (always singular)</p></li></ul></li><li><p><strong>Key</strong> must <u>always be included for clarity.</u></p></li></ul><p></p>
21
New cards

Standard Stem and Leaf Plot

  • Suitable for up to 50 data points.

  • Key: Must include units and help interpret the diagram.

  • Title: Must be included.

  • Aim to have between 5 and 10 class intervals.

  • Split stems: Data range (e.g., 0-4) uses one stem, and the next range (e.g., 5-9) uses a second stem.*

  • Asterisk (*): Indicates the second split stem.

  • Advantages:

    • Retains original data values.

    • Shows shape, outliers, centre, and spread of the distribution.

*e.g., 1 | 0, 1, 2, 3, 4 for data points 10-14 and 1* | 5, 6, 7, 8, 9 for data points 15-19.

<ul><li><p>Suitable for up to <strong>50 data points.</strong></p></li><li><p><strong>Key</strong>: Must include units and help interpret the diagram.</p></li><li><p><strong>Title</strong>: <em>Must be included.</em></p></li><li><p><span>Aim to have between <strong>5 and 10 </strong>class intervals.</span></p></li><li><p><strong>Split stems</strong>: Data range (e.g., 0-4) uses one stem, and the next range (e.g., 5-9) uses a second stem.*</p></li><li><p><strong>Asterisk (*)</strong>: Indicates the second split stem.</p></li><li><p><strong>Advantages</strong>:</p><ul><li><p>Retains original data values.</p></li><li><p>Shows <u>shape, outliers, centre, and spread of the distribution.</u></p></li></ul></li></ul><p>*e.g., 1 | 0, 1, 2, 3, 4 for data points 10-14 and 1* | 5, 6, 7, 8, 9 for data points 15-19.</p>
22
New cards

Summary – Univariate data Display

Categorical Distribution

Numerical Distribution

Table

Frequency table

Grouped Frequency Table

Chart

Bar Chart

(Percentage) Segmented Bar Chart – max 5 categories

Histogram (N > 40)

Stem and Leaf Plots (N < 50)

Dot Plots (N<20)

Box Plots (seen next year)

23
New cards

Numerical univariate Data – Histogram
Frequency Table vs Group Frequency Table (Discrete / Continuous Distribution)

Discrete Distribution:

Data is placed in middle of bar!

Continuous Distribution:

Number is placed on the end of each bar to demonstrate how it is a range

24
New cards

Interpretation of Categorical Variable Distribution

Qualitative data that is classified, not quantified.

  • Descriptive Measures: numbers used to describe data sets.

  • E.g., Measures of central tendency: Mean (ordinal), median (ordinal), mode

  • Examples:

    • Tables: Frequency tables or percentage frequency tables.

    • Graphs: Bar charts or segmented bar charts.

25
New cards

Analysing Categorical Variable Distribution

  • State Total Frequency (Sample Size): Include the total number of observations and different categories/options.

  • Identify Modal Category: Mention if significantly larger than others.

  • Percentage Frequencies:

    • Provide for the modal category.

    • Optionally include others if relevant.

  • Focus on Key Categories: Avoid listing all when many exist.

    • Use Descriptive Terms: Clearly interpret trends or patterns.

Example: The type of oyster sizes of 20 oysters were classified as “small”, “medium” or “large”. The majority of 50% oysters were found to be of medium size. Of the remaining oysters, 35% were found to be small and 15% were found to be large.

26
New cards

Interpreting Univariate Numerical Data - SOCS

In any analysis of univariate numerical data, you must make specific mention of 4 key features:​

  • Shape​

  • Outliers​

  • Centre​

  • Spread

27
New cards

Key features of a histogram - Shape

  • Negatively Skewed (-ve): tails to the left towards -ve direction, mean < median.

  • Positively Skewed (+ve): tails to the right towards +ve direction, mean > median.

<ul><li><p><em>Negatively Skewed (-ve)</em>: <u>tails to the left</u> towards -ve direction, <strong>mean &lt; median.</strong></p></li><li><p><em>Positively Skewed (+ve)</em>: <u>tails to the right</u> towards +ve direction, <strong>mean &gt; median.</strong></p></li></ul><p></p>
28
New cards

Key Features of stem-and-leaf plot: Shape

Tip: Rotate your book 90° to determine where the tail of your data is moving towards!

<p><span><strong>Tip: </strong>Rotate your book 90° to determine where the tail of your data is moving towards!</span></p>
29
New cards

Key features of a histogram - Outliers

  • Definition: Data values that deviate significantly from the main dataset (typically high/low).

  • Possible Causes:

    • Experimental error

    • Indication of novel/unique data

    • Extreme or "freak" values

  • Effects on Measures:

    • Not Affected: Mode, median

    • Significantly Affected: Mean, range

<ul><li><p><strong>Definition</strong>: Data values that deviate significantly from the main dataset (typically high/low).</p></li><li><p><strong>Possible Causes</strong>:</p><ul><li><p>Experimental error</p></li><li><p>Indication of novel/unique data</p></li><li><p>Extreme or "freak" values</p></li></ul></li><li><p><strong>Effects on Measures</strong>:</p><ul><li><p><strong>Not Affected</strong>: Mode, median</p></li><li><p><strong>Significantly Affected</strong>: Mean, range</p></li></ul></li></ul><p></p>
30
New cards

Outliers Test

Step 1: Determine the Median (Q₂)
Q₂ = (n + 1) / 2

Step 2: Identify Q₁ and Q₃

  • Split the data into halves (below and above the median).

  • Use (n + 1) / 2 within each half to find Q₁ and Q₃.

Step 3: Calculate IQR
IQR = Q₃ - Q₁

Step 4: Perform Outlier Test

  • Lower Bound: Q₁ - 1.5 × IQR

  • Upper Bound: Q₃ + 1.5 × IQR

  • Any values outside these bounds are outliers.

31
New cards

CAS: Outliers Test

Using the Statistics mode:

Enter values into "list 1"

Enter frequencies into "list 2"

Set the graph Type to "MedBox" with "Show Outliers" selected.

Press Set

Analysis → Trace (Show outlier)

<p><span>Using the&nbsp;<strong>Statistics</strong>&nbsp;mode:</span></p><p><span>Enter values into "list 1"</span></p><p><span>Enter frequencies into "list 2"</span></p><p><span>Set the graph Type to "MedBox" with "Show Outliers" selected.</span></p><p><span>Press Set</span></p><p><span>Analysis → Trace (Show outlier)</span></p>
32
New cards

Key features of a histogram - Centre

  • Median Class: The category containing the middle position.

  • Mean: Average; use only when data is symmetrical and outlier-free.

  • Modal Class: Category with the highest frequency;

    • Relevant only when one category significantly stands out.

    • Median class is usually more useful for describing the center.

<ul><li><p><strong>Median Class</strong>: The category containing the middle position.</p></li><li><p><strong>Mean</strong>: Average; use only when data is symmetrical and outlier-free.</p></li><li><p><strong>Modal Class</strong>: Category with the highest frequency;</p><ul><li><p>Relevant only when one category significantly stands out.</p></li><li><p>Median class is usually more useful for describing the center.</p></li></ul></li></ul><p></p>
33
New cards

CAS: Center

  • Set Variables:

    • List 1: Variable (x-axis), rename to description.

    • List 2: Frequency (y-axis).

  • Run Calculation:

    • Go to Calc → One-Variable.

    • Set Xlist to the data column.

    • Set Frequency:

      • 1 for single-column data.

      • Second column for grouped data.

    • Click OK.

  • Show Results:

    • Go to Calc → Display Stat.

<ul><li><p><strong>Set Variables</strong>:</p><ul><li><p>List 1: Variable (x-axis), rename to description.</p></li><li><p>List 2: Frequency (y-axis).</p></li></ul></li><li><p><strong>Run Calculation</strong>:</p><ul><li><p>Go to <em>Calc → One-Variable</em>.</p></li><li><p>Set <strong>Xlist</strong> to the data column.</p></li><li><p>Set <strong>Frequency</strong>:</p><ul><li><p>1 for single-column data.</p></li><li><p>Second column for grouped data.</p></li></ul></li><li><p>Click <strong>OK</strong>.</p></li></ul></li><li><p><strong>Show Results</strong>:</p><ul><li><p>Go to <em>Calc → Display Stat</em>.</p></li></ul></li></ul><p></p>
34
New cards

Key features of a histogram - Spread

  • Range: Largest value - smallest value

    • Use when no outliers.

  • IQR: Q₃ - Q₁

    • Use when data is skewed or contains outliers.

  • minX: Smallest data point in data

  • maxX: Largest data point in data

<ul><li><p><strong>Range</strong>: Largest value - smallest value</p><ul><li><p>Use when no outliers.</p></li></ul></li><li><p><strong>IQR</strong>: Q₃ - Q₁</p><ul><li><p>Use when data is skewed or contains outliers.</p></li></ul></li><li><p><strong>minX: Smallest data point in data</strong></p></li><li><p><strong>maxX: Largest data point in data</strong></p></li></ul><p></p>
35
New cards

Purpose and application of logarithmic scales to display data

  • Purpose:

    • Fit curves to non-linear relationships by applying a logarithmic function, making the data closer to a straight line.

    • Analyze large ranges of values in a compact form.

    • Respond to skewness in large data sets.

  • Applications:

    • Compress larger x-values by changing the scale to log₁₀(x).

    • Display data with wide ranges or exponential growth/decay.

    • Replace each x-value with its logarithm.

  • Equation: log₁₀(x) = b, then 10ᵇ = x

    • Example: log₁₀(8) ≈ 0.9, since 10⁰.⁹ ≈ 8.

<ul><li><p><strong>Purpose</strong>:</p><ul><li><p>Fit curves to non-linear relationships by applying a logarithmic function, making the data closer to a straight line.</p></li><li><p>Analyze large ranges of values in a compact form.</p></li><li><p>Respond to skewness in large data sets.</p></li></ul></li><li><p><strong>Applications</strong>:</p><ul><li><p>Compress larger x-values by changing the scale to log₁₀(x).</p></li><li><p>Display data with wide ranges or exponential growth/decay.</p></li><li><p>Replace each x-value with its logarithm.</p></li></ul></li></ul><ul><li><p><strong>Equation</strong>: log₁₀(x) = b, then 10ᵇ = x</p><ul><li><p>Example: log₁₀(8) ≈ 0.9, since 10⁰.⁹ ≈ 8.</p></li></ul></li></ul><p></p>
36
New cards

Properties of Log10(x)

  • If x > 1, then log₁₀(x) is positive.

  • If 0 < x < 1, then log₁₀(x) is negative.

  • If x ≤ 0, then log₁₀(x) is undefined.

  • If x = 1, then log₁₀(x) is zero.

<ul><li><p>If <strong>x &gt; 1,</strong> then log₁₀(x) is <strong>positive.</strong></p></li><li><p>If <strong>0 &lt; x &lt; 1, </strong>then log₁₀(x) is <strong>negative.</strong></p></li><li><p>If <strong>x ≤ 0,</strong> then log₁₀(x) is <strong>undefined.</strong></p></li><li><p>If <strong>x = 1,</strong> then log₁₀(x) is <strong>zero.</strong></p></li></ul><p></p>
37
New cards

CAS: Logarithm

  • Logarithm of 45:

    • log₁₀(45) ≈ 1.653

    • Use CAS: log(10, 45).

  • Find number for log = 2.7125:

    • log₁₀(x) = 2.7125, solve for x.

    • x ≈ 515.

    • Use CAS: solve(log(10, x) = 2.7125, x) or 102.7125

<ul><li><p><strong>Logarithm of 45</strong>:</p><ul><li><p>log₁₀(45) ≈ 1.653</p></li><li><p>Use CAS: log(10, 45).</p></li></ul></li><li><p><strong>Find number for log = 2.7125</strong>:</p><ul><li><p>log₁₀(x) = 2.7125, solve for x.</p></li><li><p>x ≈ 515.</p></li><li><p>Use CAS: solve(log(10, x) = 2.7125, x) or 10<sup>2.7125</sup></p><p></p></li></ul></li></ul><p></p>
38
New cards

Constructing Histograms on CAS (Inc. Log)

  • Open Statistics mode.

  • Enter data into List 1.

  • Go to SetGraph:

    • Turn Draw to On.

    • Set Type to Histogram.

    • Choose Xlist as List 1.

    • Set Freq to 1.

  • Press Set.

  • Select the Graph button for the histogram.

  • In the Set Interval box:

    • Set Hstart to the given value.

    • Set Hstep to the given value.

    ——————————————————- Log

  • In List 2, go to the last cell (Cal).

  • Type log(List1) in the calculation window.

  • Create a new Histogram:

    • Set Xlist to List 2.

    • Keep Freq as 1.

The histogram will display with bars starting at Hstart, increasing by Hstep per bar.

<ul><li><p>Open <strong>Statistics</strong> mode.</p></li><li><p>Enter data into <strong>List 1</strong>.</p></li><li><p>Go to <strong>SetGraph</strong>:</p><ul><li><p>Turn <strong>Draw</strong> to <strong>On</strong>.</p></li><li><p>Set <strong>Type</strong> to <strong>Histogram</strong>.</p></li><li><p>Choose <strong>Xlist</strong> as <strong>List 1</strong>.</p></li><li><p>Set <strong>Freq</strong> to <strong>1</strong>.</p></li></ul></li><li><p>Press <strong>Set</strong>.</p></li><li><p>Select the <strong>Graph</strong> button for the histogram.</p></li><li><p>In the <strong>Set Interval</strong> box:</p><ul><li><p>Set <strong>Hstart</strong> to <strong>the given value</strong>.</p></li><li><p>Set <strong>Hstep</strong> to <strong>the given value</strong>.</p></li></ul><p>——————————————————- Log <span data-name="arrow_down" data-type="emoji">⬇</span></p></li><li><p>In <strong>List 2</strong>, go to the last cell (<strong>Cal</strong>).</p></li><li><p>Type <strong>log(List1)</strong> in the calculation window.</p></li><li><p>Create a new <strong>Histogram</strong>:</p><ul><li><p>Set <strong>Xlist</strong> to <strong>List 2</strong>.</p></li><li><p>Keep <strong>Freq</strong> as <strong>1</strong>.</p></li></ul></li></ul><p>The histogram will display with bars starting at Hstart, increasing by Hstep per bar.</p>
39
New cards

Five-Number Summary

  • Minimum Score: Lowest score.

  • Q1: 25% of data is below.

  • Median (Q2): Midpoint, 50% of data is below.

  • Q3: 75% of data is below.

  • Maximum Score: Highest score.

40
New cards

Median Properties and Use

  • Not affected by outliers.

  • Applied to ordinal, discrete, and continuous data.

  • Use when distribution is skewed or non-normal, or data is ordinal

  • Formula: Median = (n+1)/2th position.

  • Odd Data: Middle value.

  • Even Data: Average of two middle values.

  • Example:

    • Data: 2, 9, 1, 8, 3, 5, 3, 8, 1 → Ordered: 1, 1, 2, 3, 3, 5, 8, 8, 9 → Median = 3.

    • Data: 10, 1, 3, 4, 8, 6, 10, 1, 2, 6 → Ordered: 1, 1, 2, 3, 4, 6, 6, 8, 10, 10 → Median = 5.

<ul><li><p>Not affected by outliers.</p></li><li><p>Applied to ordinal, discrete, and continuous data.</p></li><li><p>Use when distribution is skewed or non-normal, or data is ordinal</p></li><li><p><strong>Formula</strong>: Median = (n+1)/2th position.</p></li><li><p><strong>Odd Data</strong>: Middle value.</p></li><li><p><strong>Even Data</strong>: Average of two middle values.</p></li><li><p><strong>Example</strong>:</p><ul><li><p>Data: 2, 9, 1, 8, 3, 5, 3, 8, 1 → Ordered: 1, 1, 2, 3, 3, 5, 8, 8, 9 → Median = 3.</p></li><li><p>Data: 10, 1, 3, 4, 8, 6, 10, 1, 2, 6 → Ordered: 1, 1, 2, 3, 4, 6, 6, 8, 10, 10 → Median = 5.</p></li></ul></li></ul><p></p>
41
New cards

Mean Calculation & Usage

  • Formula: x̄ = Σx / n

  • Use: Data with equal intervals, symmetric distribution, no outliers.

  • Sensitive to: Outliers in skewed data.

<ul><li><p><strong>Formula</strong>: x̄ = Σx / n</p></li><li><p><strong>Use</strong>: Data with equal intervals, symmetric distribution, no outliers.</p></li><li><p><strong>Sensitive to</strong>: Outliers in skewed data.</p></li></ul><p></p>
42
New cards

Mean v.s. Median: Better measure of centre

  • Median: Better for skewed data or with outliers (based on order, not values).

  • Mean: Best for symmetric data with no outliers, gives average value.

43
New cards

Summary of Advantages/Disadvantages of measures of central tendency

Benefits

Disadvantage

Mode

•Quick and easy to compute

•Useful for nominal data

•Poor sampling stability

Median

•Not affected by extreme scores

•Somewhat poor sampling stability

Mean

•Sampling stability

•Related to variance

•Provides characteristic of distribution

•Inappropriate for discrete data

•Affected by skewed data

•Less reliable when distribution is skewed or contains outliers

<table style="min-width: 75px"><colgroup><col style="min-width: 25px"><col style="min-width: 25px"><col style="min-width: 25px"></colgroup><tbody><tr><td colspan="1" rowspan="1" style="height:29.2pt;width:146pt"><p style="text-align: left"></p></td><td colspan="1" rowspan="1" style="width:382pt"><p style="text-align: left"><span><strong>Benefits</strong></span></p></td><td colspan="1" rowspan="1" style="width:264pt"><p style="text-align: left"><span><strong>Disadvantage</strong></span></p></td></tr><tr><td colspan="1" rowspan="1" style="height:29.2pt;width:146pt"><p style="text-align: left"><span>Mode</span></p></td><td colspan="1" rowspan="1" style="width:382pt"><p><span>•Quick and easy to compute</span></p><p><span>•Useful for nominal data</span></p></td><td colspan="1" rowspan="1" style="width:264pt"><p><span>•Poor sampling stability</span></p></td></tr><tr><td colspan="1" rowspan="1" style="height:29.2pt;width:146pt"><p style="text-align: left"><span>Median</span></p></td><td colspan="1" rowspan="1" style="width:382pt"><p><span>•Not affected by extreme scores</span></p></td><td colspan="1" rowspan="1" style="width:264pt"><p><span>•Somewhat poor sampling stability</span></p></td></tr><tr><td colspan="1" rowspan="1" style="height:29.2pt;width:146pt"><p style="text-align: left"><span>Mean</span></p></td><td colspan="1" rowspan="1" style="width:382pt"><p><span>•Sampling stability</span></p><p><span>•Related to variance</span></p><p><span>•Provides characteristic of distribution</span></p></td><td colspan="1" rowspan="1" style="width:264pt"><p><span>•Inappropriate for discrete data</span></p><p><span>•Affected by skewed data</span></p><p><span>•Less reliable when distribution is skewed or contains outliers</span></p></td></tr></tbody></table><p></p>
44
New cards

Mode

  • Most common value in a data set.

  • Types: Unimodal, Bimodal, Trimodal, Multimodal.

  • Can be used for both qualitative and quantitative data.

  • Not affected by outliers, but may be close to extreme values, making it a weak measure of centre.

<ul><li><p><strong>Most common value in a data set.</strong></p></li><li><p><strong>Types:</strong> Unimodal, Bimodal, Trimodal, Multimodal.</p></li><li><p>Can be used for both <strong>qualitative and quantitative</strong> data.</p></li><li><p><strong>Not affected by outliers</strong>, but may be close to extreme values, making it a <strong>weak measure of centre.</strong></p></li></ul><p></p>
45
New cards

Range

  • Definition: Measure of the maximum spread.

  • Formula: Range = Largest Data Value - Smallest Data Value.

  • Note: Affected heavily by outliers, making it an unreliable measure of spread in skewed data.

<ul><li><p><strong>Definition</strong>: Measure of the maximum spread.</p></li><li><p><strong>Formula</strong>: Range = Largest Data Value - Smallest Data Value.</p></li><li><p><strong>Note</strong>: Affected heavily by outliers, making it an unreliable measure of spread in skewed data.</p></li></ul><p></p>
46
New cards

Interquartile Range (IQR)

  • Definition: Measures the spread of the middle 50% of data values.

  • Note: Divides data into quarters.

  • Advantage: Generally not affected by outliers, making it more reliable than the range.

<ul><li><p><strong>Definition</strong>: Measures the spread of the middle 50% of data values.</p></li><li><p><strong>Note</strong>: Divides data into quarters.</p></li><li><p><strong>Advantage</strong>: Generally not affected by outliers, making it more reliable than the range.</p></li></ul><p></p>
47
New cards

Standard Deviation (s)

  • Definition: Measures the spread of data around the mean (how far values are from the mean value).

  • Low SD: Data points are close to the mean.

  • High SD: Data is spread over a wider range of values.

  • Formula: s = √(∑(x - x̄)² / (n - 1))

    • x - x̄: Deviation

    • ∑(x - x̄)²: Sum of squared deviations

  • Note: Without squaring, the sum of deviations equals zero.

  • Sensitive to Outliers

<ul><li><p><strong>Definition</strong>: Measures the spread of data around the mean (<span>how <strong>far values are from the mean value</strong></span>).</p></li><li><p><strong>Low SD</strong>: Data points are close to the mean.</p></li><li><p><strong>High SD</strong>: Data is spread over a wider range of values.</p></li><li><p><strong>Formula</strong>: s = √(∑(x - x̄)² / (n - 1))</p><ul><li><p>x - x̄: Deviation</p></li><li><p>∑(x - x̄)²: Sum of squared deviations</p></li></ul></li><li><p><strong>Note</strong>: Without squaring, the sum of deviations equals zero.</p></li><li><p><strong>Sensitive to Outliers</strong></p></li></ul><p></p>
48
New cards

CAS: Finding Values

knowt flashcard image
49
New cards

Choice of Spread

  • Median (M): Use Interquartile Range (IQR) for spread.

  • Mean (x̄): Use Standard Deviation (SD) for spread.

50
New cards

Box Plot

  • Median: Splits data into two equal sections (50% above, 50% below).

  • Quartiles: Split each 50% section into halves, creating 25% sections.

  • IQR: The box represents 50% of data, from Q1 to Q3.

  • Boxplot Components:

    • Number Line: Equal intervals.

    • Title: Represents the measured variable.

    • Quartiles: Vertical edges of the box.

    • Median: Vertical line inside the box.

    • Whiskers: Lines extending from the box.

    • Dots: Outliers.

<ul><li><p><strong>Median</strong>: Splits data into two equal sections (50% above, 50% below).</p></li><li><p><strong>Quartiles</strong>: Split each 50% section into halves, creating 25% sections.</p></li><li><p><strong>IQR</strong>: The box represents 50% of data, from Q1 to Q3.</p></li><li><p><strong>Boxplot Components</strong>:</p><ul><li><p><strong>Number Line</strong>: Equal intervals.</p></li><li><p><strong>Title</strong>: Represents the measured variable.</p></li><li><p><strong>Quartiles</strong>: Vertical edges of the box.</p></li><li><p><strong>Median</strong>: Vertical line inside the box.</p></li><li><p><strong>Whiskers</strong>: Lines extending from the box.</p></li><li><p><strong>Dots</strong>: Outliers.</p></li></ul></li></ul><p></p>
51
New cards

Fences and Outliers + Relating Box Plot to Shape

  • Fences: Used to identify outliers in data.

  • Lower Fence: Q1 - 1.5 × IQR.

  • Upper Fence: Q3 + 1.5 × IQR.

  • Outliers: Data points outside the fences, denoted by open circles.

<ul><li><p><strong>Fences</strong>: Used to identify outliers in data.</p></li><li><p><strong>Lower Fence</strong>: Q1 - 1.5 × IQR.</p></li><li><p><strong>Upper Fence</strong>: Q3 + 1.5 × IQR.</p></li><li><p><strong>Outliers</strong>: Data points outside the fences, denoted by open circles.</p></li></ul><p></p>
52
New cards

Steps for Skewness in Box Plots

  1. Ignore outliers.

  2. Measure the minimum value → median distance;

  3. Measure the median → maximum value distance;

  4. Use formula: (Highest distance - lower distance)/lower distance.

  5. See which one is more. If higher more, positively skewed; if lower more, negatively skewed; if almost the same, then symmetric.

53
New cards

Boxplot Analysis: SOCS

Shape

Data set creates in a graph (i.e skewness)

Outliers

Extreme values

Centre

Median measurement of the center (outlier resistant)

Spread

IQR and range

Sample Prompt for a single box plot: The distribution is (state shape) and (state whether there are outliers present). The distribution is centred at (state the median), the median value. The spread of the distribution, as measured by the IQR, is (state IQR) and, as measured by the range (state range).

Note: If the question states units, include them.

  For example: Age (years), time (hours), pulse rate (beats per minute).

<table style="min-width: 50px"><colgroup><col style="min-width: 25px"><col style="min-width: 25px"></colgroup><tbody><tr><td colspan="1" rowspan="1" style="height:51.08pt;width:106pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Shape</span></p></td><td colspan="1" rowspan="1" style="width:435pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Data set creates in a graph (</span><span>i.e</span><span style="font-family: &quot;Arial Nova&quot;">&nbsp;skewness)</span></p></td></tr><tr><td colspan="1" rowspan="1" style="height:45.97pt;width:106pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Outliers</span></p></td><td colspan="1" rowspan="1" style="width:435pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Extreme values</span></p></td></tr><tr><td colspan="1" rowspan="1" style="height:73.06pt;width:106pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Centre</span></p></td><td colspan="1" rowspan="1" style="width:435pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Median measurement of the&nbsp;center&nbsp;(outlier resistant)</span></p></td></tr><tr><td colspan="1" rowspan="1" style="height:43.97pt;width:106pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">Spread</span></p></td><td colspan="1" rowspan="1" style="width:435pt"><p style="text-align: left"><span style="font-family: &quot;Arial Nova&quot;">IQR and range</span></p></td></tr></tbody></table><p><span><strong>Sample Prompt for a single box plot: </strong>The distribution is <em>(state shape) </em>and (state whether there are outliers present). The distribution is centred at <em>(state the median), </em>the median value. The spread of the distribution, as measured by the IQR, is <em>(state IQR) </em>and, as measured by the range <em>(state range).</em></span></p><p style="text-align: left"><span><strong><em>Note: </em></strong><em>If the question states units, include them.</em></span></p><p style="text-align: left"><span><em>&nbsp; For example: Age (years), time (hours), pulse rate (beats per minute).</em></span></p>
54
New cards

Normal Distribution

  • Definition: Data is evenly spread in a bell-shaped curve around the mean.

  • Properties:

    • 50% of data is above and below the mean.

    • Symmetric about the mean.

    • Width is approx. 3 standard deviations from the mean.

    • Shape and size determined by mean and standard deviation.

  • Example: Height, blood pressure, measurement errors.

  • Type: Continuous data.

<ul><li><p><strong>Definition</strong>: Data is evenly spread in a bell-shaped curve around the mean.</p></li><li><p><strong>Properties</strong>: </p><ul><li><p>50% of data is above and below the mean.</p></li><li><p>Symmetric about the mean.</p></li><li><p>Width is approx. 3 standard deviations from the mean.</p></li><li><p>Shape and size determined by mean and standard deviation.</p></li></ul></li><li><p><strong>Example</strong>: Height, blood pressure, measurement errors.</p></li><li><p><strong>Type</strong>: Continuous data.</p></li></ul><p></p>
55
New cards

Mean in Normal Distribution

  • Position: Center of the bell curve.

  • Characteristics:

    • Maximum density of observation (highest point of the curve).

    • Mean = Median = Mode.

    • Shifting the mean moves the curve left or right.

  • Behavior: Data clusters around the mean in a normal distribution.

<ul><li><p><strong>Position</strong>: Center of the bell curve.</p></li><li><p><strong>Characteristics</strong>:</p><ul><li><p>Maximum density of observation (highest point of the curve).</p></li><li><p>Mean = Median = Mode.</p></li><li><p>Shifting the mean moves the curve left or right.</p></li></ul></li><li><p><strong>Behavior</strong>: Data clusters around the mean in a normal distribution.</p></li></ul><p></p>
56
New cards

Empirical Rule (68-95-99.7 Rule)

  • 68% of observations are within 1 standard deviation (x̄ ± s).

  • 95% of observations are within 2 standard deviations (x̄ ± 2s).

  • 99.7% of observations are within 3 standard deviations (x̄ ± 3s).

<ul><li><p><strong>68%</strong> of observations are within <strong>1 standard deviation</strong> (x̄ ± s).</p></li><li><p><strong>95%</strong> of observations are within <strong>2 standard deviations</strong> (x̄ ± 2s).</p></li><li><p><strong>99.7%</strong> of observations are within <strong>3 standard deviations</strong> (x̄ ± 3s).</p></li></ul><p></p>
57
New cards

Z - Score and Standardization

  • Definition: A z-score represents how many standard deviations a data point is from the mean.

  • Formula:
    z = (x - x̄) / s
    Where:

    • x = data value

    • x̄ = mean

    • s = standard deviation

  • Interpretation:

    • Positive z-score: Data point is above the mean.

    • Negative z-score: Data point is below the mean.

    • Z-score of 0: Data point is at the mean.

  • Uses:

    • Standardize data across distributions.

    • Compare relative positions of data points.

    • Calculate the area under the curve for a given z-score.

  • Examples:

    • Z = 2: 2 standard deviations above the mean.

    • Z = -3: 3 standard deviations below the mean.

58
New cards

Standard Z-Scores to Actual Values

  • Formula:
    x = (z * s) + x̄
    Where:

    • x = actual data value

    • z = z-score

    • s = standard deviation

    • x̄ = mean

  • Interpretation:
    To find the actual score (x) from a standard score (z), multiply the z-score by the standard deviation (s) and then add the mean (x̄).

  • Example:
    If the mean is 50 and the standard deviation is 5, and the z-score is 2, the actual score is:
    x = (2 * 5) + 50 = 60