I

kine-2050-notes

Topic 1 - Introduction to Statistics

  • Statistics: A set of procedures and principles for collecting, organising, analysing, and interpreting data to help people make decisions under uncertainty.

  • H.G. Wells (1903): Hypothesised that statistical thinking would become as crucial as literacy for good citizenship.

  • Importance of Studying Statistics:

    • Necessary for evaluating scientific evidence in any scientific discipline.

    • Enables responsible evaluation of statistical arguments used to influence behaviour.

  • What is Statistics?

    • The science of collection, organisation, analysis, and interpretation of data.

  • Big Picture Goal: To infer conclusions about a population using data from a sample.

  • Purpose of Data: To get necessary information and knowledge.

Case Study: How Nimble Are Your Fingers?

  • Manual Dexterity Test: Measures how many small pieces one can assemble in one minute.

  • Summaries: Mean and other measures of central tendency, variance measures, min/max scores, and the number of participants.

  • Moral: Simple data summaries can tell interesting stories and make data easier to understand. Data is used to judge and make decisions.

The Discovery of Knowledge

  1. Asking the right question(s).

  2. Collecting useful data including deciding how much is needed.

  3. Summarising and analysing data to answer the question(s).

  4. Making decisions and generalisations based on the observed data.

  5. Turning the data and subsequent decisions into new knowledge.

Two Types of Statistics

  1. Descriptive Statistics

    • Organise, describe, and summarise a small dataset.

    • Results represent the entire dataset.

    • The sampled group is the group of interest.

    • Can be the first stage of analysis.

    • Often used when researchers begin a new area of investigation.

  2. Inferential Statistics

    • Conclusions about populations are derived from small, random samples.

    • The sampled group is a sample of the group of interest.

    • Use smaller datasets to make estimates and draw conclusions about a larger population from which the sample is drawn.

    • Can determine cause-and-effect relationships, test hypotheses, and make predictions.

Population and Sample

  • Population: All objects that researchers want to describe or make inferences about.

    • Characteristic of population = “parameter” (N)

  • Sample: A sub-group of a population that the researcher believes represents the population.

    • A group of a specific size (n) is selected and measured.

    • Characteristic of a sample = “statistic”

  • Random Samples: Best for experiments because they are unbiased.

    • Every element of the population must be equally likely to be selected for the sample group.

    • The selection of one element does not affect the possibility of other elements being selected.

Variables

  • Characteristics of a person, object, or phenomenon that is amenable to change and measurable.

  • Any observable/measurable property of organisms, objects, or events.

Types of Variables

  1. Quantitative (Numerical)

  2. Qualitative (Categorical)

Quantitative Variables
  • Numerical data that can be added, subtracted, multiplied, and divided.

    • Examples:

      • Age (years), Blood pressure (mm of Hg)

      • BMI, Pulse per minute

      • Exercise in hours per week, Coffee drinking in ounces per day

Continuous vs. Discrete
  • Continuous: Can theoretically take on any value within a given range (e.g., height = 188.99955 cm).

  • Discrete: Can only take on certain values (e.g., number of children in a family, number of cities).

Qualitative Variables
  • Binary: Two categories

    • Examples:

      • Dead/alive

      • Treatment/placebo

      • Disease/no disease

      • Exposed/Unexposed

      • Heads/Tails

      • Did you have breakfast in the morning? (Yes/No)

  • More than Two categories

    • Example:

      • Hair colour – Blonde, Red-haired, Brown, and so forth.

Classification of Data or Levels of Measurement

  • Four levels of measurement:

    • Qualitative

      • Nominal

      • Ordinal

    • Quantitative

      • Interval

      • Ratio

A) Nominal

  • Data placed in categories (no ordering).

  • Cannot be quantified.

  • Mutually exclusive.

  • e.g., blood type, type of car owned, gender, colour of paint.

B) Ordinal

  • Data is ranked.

  • e.g., “Idol” contest, preference (first, second, third), mineral hardness, cancer stages, University ranking, Letter-grades.

C) Interval

  • Equal units of measurement assigned to the attribute.

  • Zero point is arbitrary!

  • Therefore, not proportional (or multiplicable).

  • e.g., temperature (F, C): temperature can be below 0 degree Celsius (-10 or -20).

D) Ratio

  • Same as interval but zero is absolute or true.

  • Zero indicates an absence of the variable.

  • Therefore, direct comparison can be made.

  • e.g., Age, distance, weight, time, money, etc.

Dependent and Independent Variables

  • Dependent Variable (DV)

    • The variable of primary interest; i.e., it is measured.

    • A variable whose changes we wish to study; a response variable.

    • The variable designed to measure the effect of the variation of the independent variable.

  • Independent Variable (IV)

    • A variable we believe affects the measurements obtained on the dependent variable; i.e., it is manipulated.

    • A variable whose effects on the dependent variable we wish to study.

    • The variable that the researcher changes within a defined range, to study the effect on the dependent variable.

Variables: Vitamin C study

  • Independent variables of daily vitamin C intake can determine the dependent variable of life span.

  • Scientists will manipulate the vitamin C intake in a group of 100 people: 50 people will be given a daily high dose of vitamin C and 50 people will be given a placebo pill over a period of 25 years. The goal is to see if the independent variable of high vitamin C dosage affects the people's life span

Experimental Control

  • “Ideal”: To imply causation, the experimenter eliminates the influence of all variables that could affect the DV except the one(s) directly manipulated.

  • All conditions are kept the same for all participants except for the effect of the IV.

  • “Reality”: Impossible to control all variables that could affect the DV.

  • Researchers control the variables they can.

  • Other influences that are not controlled are assumed to be randomized. (i.e., we assume the effects are “washed out” if they are “spread out” over the groups).

Statistical Methods

  • The researcher's tools:

    1. To assist in describing data

    2. In making inferences or generalisations from experimental data (sample) to larger groups (population)

    3. In studying causal relationships

Topic 2: Organising and Displaying Data

  • Good research is based on collecting large amounts of data, which needs to be simplified.

  • Frequency distribution: Lists all possible data values or types and the frequency of occurrence of each one.

    • Meant to organise and describe the data in table form.

    • Use the frequency table to construct a frequency histogram (graph).

    • Reveal the pattern of the scores/observations.

Types of Frequency Distributions

  • a) Ungrouped: does not need grouping (blood)

    • Frequency of all the possible data values or items in your dataset.

    • Can be nominal/ordinal categories OR quantitative but small numbers of single values.

    • Already grouped

    • Blood type

  • b) Grouped (class intervals): bundled into chunks

    • Applies when all “possible data values” would be too many, so data is arranged and separated into groups called class intervals.

    • Each class interval includes a range of data

    • Bundle into chunks

Steps in Constructing a Frequency Distribution

  • Seven steps in constructing a Grouped frequency distribution.

    • Step 1: Count the number of scores

    • Step 2: Identify highest and lowest score (range)

    • Step 3: Identify the smallest unit of measurement

      • What is the smallest division (possible)

      • (i.e., by how much can your score increase

    • Step 4: Decide on the appropriate number of class intervals

    • Step 5: Decide on the score range of each class interval

    • Step 6: Round this class interval to make this range PRETTY An i (or class interval) of 16.57 or 14.5 would be ugly and a clunky range!

      • Instead, 20, 15, 12 or 10 would be prettier. So try a couple of them, starting with a class interval of 15. And maybe go smaller afterward.

    • Step 7: List class intervals of scores in order Make sure that intervals have:

      1. Same width (range of numbers)

      2. no overlap across intervals

      3. no gaps

Ungrouped distributions

  • Use UNGROUPED (data that's already grouped, nominal, blood type, no order or connection)

    1. when data are items rather than numbers, i.e., nominal or ordinal (qualitative) values

    2. when can use all possible data values without being too many (< 15) e.g., a small number of possible discrete scores (e.g., how many courses this class is enrolled in this term)

  • In this case, the number of rows in the frequency table is clear, so don’t need steps, just list the items or values and start tallying

Grouped distributions

  • Use GROUPED data When data values are continuous (e.g., weight, time, blood pressure) or too many possible data values (e.g., age, or salary)

    • Thus, need a range of values known as the class intervals (i) or Bins; calculated in step 5

  • A good start is to first estimate what would be a good number of bins (step 4), but may need to redo

    • Steps 4 and 5 to get a “PRETTY” class interval or bin, e.g., 2, 5, 10, 12, 15, 20 etc (or 0.01, 0.2, 0.5 etc)

  • To correctly GROUP data

    • After calculating the class interval (range), add this i to the start of each bin, starting with the smallest score or value (or round down to have a pretty lower number) in the dataset

    • These bins should have (1) the same width/range (2) no overlap across bins (3) no gaps (unit change grams e.g.) (4) cover all the data in that set.

  • NOTE: Sometimes you will want to eliminate extreme scores from your data doing steps 2-6 to determine class intervals. But after determining the class intervals, I need to add this extreme score to the tally. All scores must be included in your distribution.

Graphs

  • Histograms only represent frequencies (ungrouped)

    • A pictorial representation of a frequency distribution or other data

    • Helpful in understanding concepts e.g., frequencies, and other summary data (next topics)

  • Two types of graphs used to plot frequency

    1. Bar graphs for Grouped data

    2. Histograms for Ungrouped data

Bar Graphs

  • Depict frequencies or other group statistics, e.g., described in the next topic as a vertical bar for each group or category

  • The groups or categories on the x-axis are nominal … so the order of the categories doesn’t matter

  • Separated by some space.

  • Bar graph with no space between the vertical space are Histograms which include interval/ratio data

    • Complex bar graphs (more info)

Histogram

  • A histogram uses vertical bars to depict the frequencies of an interval/ratio variable (but not other statistics).

  • A histogram differs from a bar graph in that it does not have spaces between the bars

  • Ungrouped FD is depicted by a bar graph while Grouped FD is depicted by a histogram

TOPIC 3: Measure of Central Tendency

  • A. Mean

  • B. Median (mdn)

  • C. Mode

A. Mean (Arithmetic average)

  • Calculate by hand from raw scores: Mean = ild{\Sigma}x/n (the sum of the scores divided by the number of scores)

    • Sensitive to extremes e.g. 1.42, 1.97, 1.42, 1.50, 1.67 = 7.98 Mean = 7.98/5 = 1.60

  • OR can calculate by computer (or statistics programs like Excel)

  • The mean of a sample of X scores is symbolised as x! (X-bar) ßstatistic

  • The mean of a population is symbolised by Greek letter \mu (mu) ß parameter

B. Median (middle score)

  • The median value indicates the midpoint in the given observations. It divides given values into 2 equal parts and locates the middle value. Count how many values you have, add 1, and divide by 2.

C. Mode

  • The values that occur most often (most common value in) e.g. 1,7,5,9,8,7 Mode = 7 e.g. 1,7,5,9,8 No Mode e.g. 1,7,7,5,9,9,8 Bimodal 7 and 9

    • However, mode is ONLY useful for categorical data, or possibly very large datasets of interval/ratio data.

Which measure to use?

  1. Nominal – has its own category; no need to group it USE MODE (Can’t calculate mean or median): e.g. Law = 64; Kine = 59; Eng. = 37

  2. Ordinal – there's a order to categories USE MEDIAN e.g. 1st, 2nd, 3rd, 4th, 5th

  3. Interval or Ratio – USE Mean and/or Median

    • Use MEDIAN instead of mean if highly skewed distribution or if have outliers (median not as affected).

Excel: Calculating central tendency measures

  • Formula in Excel

    • Mean = average(data_set)

    • Median = median(data_set)

    • Mode = mode(data_set)

  • Or, given the Toolpak, run Descriptive statistics under Data Analysis

  • Graphs for summary data

    • The distribution of data, including central tendencies like mean and medians, can also be graphically depicted

Bar Graphs & Histograms

  • Bar graphs and Histograms can be used to plot means (or medians) of the dependent variable (DV) along the y-axis across

    • Bar graphs: x-axis represents different groups or categories (nominal); and have space between bars

    • Histograms: x-axis plots the independent variable that are interval/ratio

Key characteristics of Graphs

  • The distribution of your sample data (for interval and ratio units of measure)

    1. Peaks

    2. Spread

    3. Kurtosis (leptokurtic, platykurtic

    4. Symmetry/Non Symmetrical distributions (POSITIVE & NEGATIVE)

1) PEAKS
  • The tallest cluster/s of bars:

  • Represent the most common values/bulk of data: the modE

2) Spread
  • How much does the data vary?

3) Kurtosis
  • The relative peaked-ness or flatness of the distribution.

  • It reflects whether the scores are more or less evenly distributed throughout the measurement range.

Leptokurtic
  • The scores are bunched together with steeply sloping sides.

Platykurtic
  • The scores are more evenly spread out:

  • A greater proportion of the scores fall toward the ends, or tails.

5) Symmetry/ Normal distribution
  • A distribution is termed Symmetrical when the data frequencies decrease at equal rates above and below a central point.

  • Visually Bisected (One half is mirror image of the other): Mean = Median = Mode

6) Non-Symmetrical distributions Skewed (positive or negative)
  • bunching of the observations at one or the other end of the measurement range

Skewness of data

  • If you wanna know if your data is skewed (median - mean \times 3/SD)

    • If it's one is not, 1-3 skewed, above 3 extremely skewed

Bimodal data

  • Why it’s not sensible to rely only on summary statistics like central tendency. SHOULD ALWAYS PLOT!

  • Distributions with outliers

    • Outliers, like skewed distributions, also affect the mean but not the median. Mean will “lean” toward the extreme – outliers or longer tail.

Examples of distribution

  • Examples of Normal distribution e.g. height, birth weight, errors in measurements or distance from bull’s eye, blood pressure, RT, MT, marks on a test (if not too easy or hard), IQ, GPA, shoe size, hours slept

  • Examples of Skewed data

    • Some variables are more likely to be positively skewed: income, scores on a difficult test, numbers of kids or cars or broken bones etc, points scored in a game, variables with a lower limit like weight.

    • Some variables are more likely to be negatively skewed: age at death/lifespan, scores on an easy test, income as a function of age, variables with an upper limit (100%).

    • A few variables are bimodal – usually reflect a combination of two distributions; peak restaurant hours, book prices, or height when include men and women. Bimodal means two peaks/mode but peaks/mode don’t need to be equal.

TOPIC 4: MEASURE OF VARIABILITY

  • Measure of Variability

    • A measure of variability is a single number that describes the spread in a set of data.

    • How much do the scores vary from one another?

  • Most common measures:

    1. Range

    2. Standard deviation (SD) = \sqrt{variance}

1. Range

  • Total spread in data

  • Range = highest score – lowest score

  • E.g. 4, 5, 7, 9 Range = 9 – 4 = 5

  • Interpretation – score range over which 100% of scores fall.

  • Advantage – very quick. – can be used for all levels of measurement

  • Disadvantage – influenced by single extreme scores

\Sigma X = add up Average Deviation

  • How much does each score deviate (vary) from the mean?

  • E.g. 14, 12, 9, 17, 8 N=5 Mean = 60 / 5 = 12

  • How much each score is from the mean?

    • 14 – 12 = 2

    • 12 – 12 = 0

    • 9 – 12 = – 3

    • 17 – 12 = 5

    • 8 – 12 = – 4

  • Note 1: Add up the deviations = 0. ALWAYS!

  • Note 2: Use absolute values (i.e. ignore negative sign). = 14. Now divide by N E.g. 14 / 5 = 2.8

  • Simple but never used.

  • Is there another method to remove the negative sign?

Variance (\sigma^2 Greek letter sigma squared )

  • Sum of Squares

    • The sum of the squared deviations from the mean, \sum (X - \mu)^2

    • Always a positive value

1. Variance
  • When divide the Sum of Squares by n (statistically known as the variance)

2. Standard Deviation
  • Measure of variability for scores about the mean.

  • Measure of the deviations of all the scores from the mean, expressed as a single number. Note this equation is different than the one for population two slides ago which ILLUSTRATES better what SD is. But this equation is similar but meant for samples (and what you need to know for this course).

  • Interpretation

    • SD – the method of specifying % of scores falling within certain score limits around the mean. e.g. What does Mean = 14 ± 2 imply?

      • 68% of scores fall within the score range of 12–16 e.g. What if Mean = 14 ± 1?

      • 68% of scores between 13–15 e.g. What if Mean = 14 ± 3?

      • 68% of scores between 11–17

  • Quick but crude estimate: SD = range / 4

Mean vs. Median

  • With extreme scores, the mean is not a good measure of central tendency. How much skewing before using median instead of mean?

  • Rule of Thumb: if the mean and median differ by 1 SD or more, then use the median

TOPIC 5 PERCENTILES AND Z SCORES

Standard Normal Distribution

  • Interquartile Range (IQR)

    • IQR is a measure of variability, based on dividing a dataset into quartiles

    • Quartiles divide a rank-ordered data set into four equal parts

    • The values that divide each part are called the first, second, and third quartiles; and

    • they are denoted by Q1, Q2,and Q3, respectively

Percentiles

  • Calculate the percentile of a raw score from ungrouped data

  • Calculate the percentile of a raw score from grouped data

  • Calculate the raw score of a given percentile

Relative Scores: Percentiles and Z-Scores

  • A method of describing the standing of an individual in relation to a group.

  • Common use with norms e.g., height/weight tables, age, fitness level, exams.

  • Achieved by translating an individual’s raw score into either percentile or z-score (transformation)

  • Raw scores refer to the original measure e.g., height, % on exam

    • transform to letter grade.

Percentiles

  • Percentile score = the PERCENTAGE of people in the group who have the same raw score or a lower raw score, than the one in question.

  • Note: 50th percentile = median

  • e.g. Exam marks out of 20. Your score = 10

    • Express the raw score of 10 as a percentile

    • What % of total scores (n=11) fall at or below 10? 2, 9, 6, 5, 16, 15, 10, 8, 7, 4, 1

    • What % of total score (n=11) fall at or below 10?

      1. Order the scores

      2. Locate the score

      3. Calculate the percentile = ordinal rank of a given value/Number of values in the data set

  • 1,2, 4, 5, 6, 7, 8, 9, 10, 15, 16

    • 10 - 9th score (9/11)*100=81.818 = 82nd percentile * 100

  • 1, 2, 4, 5, 6, 7, 8, 9, 10, 15, 16

    • 9th score

    • 9/11 x 100 = 81.82 = 82nd percentile

Percentiles for a Normal Distribution

  • Formula for Percentile from Grouped Data

    • x = the score you are converting to percentile

    • LL = lower limit of the class interval that contains the score

    • i = the size of each class interval

    • fw = the frequency of scores in the interval that contains the score

    • \sum fb = sum of the scores below the interval

    • N = number of scores in the dataset

Percentiles

  • Also possible to calculate in reverse

  • calculate what raw score a certain percentile represents

  • For example, if we want to cut or drop the bottom 30% of applicants

  • We need to calculate the cut-off score at the 30th percentile

  • Then, cut the applicants at or below this score

Finding the Raw Score of a Given Percentile

  • Bottom 30% of applicants?

  • Same symbol as before with one addition, P = percentile as a decimal

  • PN = (0.3016) =4.8 = 5 (ALWAYS ROUND)

  • 5 scores from the bottom of the table = 0.28 to 0.31 interval

Z-Scores

  • How to calculate a Z-Score

  • Three uses of Z-Scores

Z Scores

  • The connection between percentiles and Z-Scores:

    1. Both are transformations from a raw score to another scale

    2. Both can be used to compare an individual with respect to a group

  • But Z-Scores differ because:

    1. Z-Scores use an interval scale (equal intervals)

    2. Based on SD (standard deviation) as a metric

Three uses of Z-Scores

  1. Compare an individual relative to the group OR calculate % of people falling within a certain score range

    • This is the same as percentiles, but the scale has equal intervals

  2. Compare score units which are different. e.g. height and weight

  3. Calculate probabilities under the normal curve – used for confidence intervals and statistics tests; i.e. Z-Scores are used as a basis for other statistics

Z-Score Formula

  • Where:

    • x = raw score

    • \bar{X} = mean

    • SD = Standard Deviation

Z-Scores

  • Given a sample mean (\bar{X}) of 5 ± a SD of 1, Determine the Z-Score for each example (raw score):

    1. x = 6: Z = (6-5)/1 = +1 or 1

    2. x = 2: Z = (2-5)/1 = -3

    3. x = 5: Z = (5-5)/1 = 0

  • \bar{X} = 20; SD = 5 cm

Z-Scores

  • The transformation from a raw score to Z-Score is achieved by changing raw score units to SD units

  • This works because of the normal distribution curve

  • Advantages

    • Uniform Scale

  • Disadvantages

    • Meaning not immediately clear

Using Z-Scores to Calculate %

  • Area under normal curve can be described

  • % under portions of the curve is described

  • Z-Score of 1 = 1 SD = 0.3413 or 34.13% (34%) of curve

  • Therefore, use Z-Scores to calculate percentiles or % of cases falling in certain score ranges.

Z-Score Concept

  • The transformation from a raw score to Z-Score is obtained by changing raw score units into SD units

  • This works because the assumption here is that we have a normal distribution, so irrespective of score units, certain proportions of scores fall within certain SD units from the mean of any distribution

  • This distribution is called the normal curve

  • Advantages:

    • Uniform scale, such as SD units

  • Disadvantages:

    • Z-Score meaning not immediately obvious

    • Example: \bar{X} = 20, SD = 5

      1. Find percentile for score 25: Z = (x - \bar{X})/SD = (25 - 20)/5 = 1.00
        Z of +1.00 = 34.13
        50 + 34.13 = 84.12 84th percentile

      2. Find the percentile for portion A: x = 15
        Z = (15 - 20)/5 = -1
        Z of -1 = 34.13
        50 - 34.13 = 15.87 16th percentile

  1. Mean = 1.70, SD =0.10 seconds
    a) What is the Z-Score for 1.75 seconds?
    Z = (1.75-1.70)/0.1 = 0.5

  • area =0.1915 or 19.15% (use table)
    b) What is the percentile for 1.75 seconds?
    percentile = 50+19.15 = 69.15 or 69th

  1. Mean = 1.70, SD =0.10 seconds
    a) What is the Z-Score for 1.62 seconds?
    1.62-1.70/0.10 = -0.8

  • area =0.2881 or 28.81%
    b) What is the Percentile for 1.62 seconds?
    50-28.81= 21.19

How to calculate the number of people rather than Percentile?

  • Need to know class size, mean, and SD

  • Example:
    \bullet n = 100 mean = 1.70 SD = 0.10 seconds
    Percentile for 1.62 = 21.19 or 21st
    \bullet What if n = 250; How many people score ≤ 1.62?
    \bullet n = 250 mean = 1.70 SD = 0.10 seconds
    Z-Score = 1.62 Percentile = 21.19
    (21.19/100) * 250
    = 52.975 = 53 people

More Examples Using the Z-Score Concept

  • If we have a group of people who’s intelligence scores (IQ) were equal to a mean of 100 and a SD of 15 (100 \pm 15)
    1) What % of group have IQ:
    a) >100
    b) >120
    Step 1: Z = (120-100)/15 = 1.33
    Step 2: Z1.33 = 0.4082
    Step 3: 50-40.82 = 9.18%
    2) If a sample of n = 140, how many will have an IQ > 120?

  • ( 9.18/100) * 140 = 0.0918 * 140 = 12.853 = 13

Three “Calculations” with Z-Scores
  1. Finding the area > x (red shade)

  2. Finding the area ≤ x (i.e. also percentile) (Blue shade)

  3. Finding the area > X ≤ x

T-Scores

  • Way of converting Z scores to an understandable form (i.e. “reshape” the distribution).

T-Scores Examples

  • a) Z = -0.36; what is