kine-2050-notes
Topic 1 - Introduction to Statistics
Statistics: A set of procedures and principles for collecting, organising, analysing, and interpreting data to help people make decisions under uncertainty.
H.G. Wells (1903): Hypothesised that statistical thinking would become as crucial as literacy for good citizenship.
Importance of Studying Statistics:
Necessary for evaluating scientific evidence in any scientific discipline.
Enables responsible evaluation of statistical arguments used to influence behaviour.
What is Statistics?
The science of collection, organisation, analysis, and interpretation of data.
Big Picture Goal: To infer conclusions about a population using data from a sample.
Purpose of Data: To get necessary information and knowledge.
Case Study: How Nimble Are Your Fingers?
Manual Dexterity Test: Measures how many small pieces one can assemble in one minute.
Summaries: Mean and other measures of central tendency, variance measures, min/max scores, and the number of participants.
Moral: Simple data summaries can tell interesting stories and make data easier to understand. Data is used to judge and make decisions.
The Discovery of Knowledge
Asking the right question(s).
Collecting useful data including deciding how much is needed.
Summarising and analysing data to answer the question(s).
Making decisions and generalisations based on the observed data.
Turning the data and subsequent decisions into new knowledge.
Two Types of Statistics
Descriptive Statistics
Organise, describe, and summarise a small dataset.
Results represent the entire dataset.
The sampled group is the group of interest.
Can be the first stage of analysis.
Often used when researchers begin a new area of investigation.
Inferential Statistics
Conclusions about populations are derived from small, random samples.
The sampled group is a sample of the group of interest.
Use smaller datasets to make estimates and draw conclusions about a larger population from which the sample is drawn.
Can determine cause-and-effect relationships, test hypotheses, and make predictions.
Population and Sample
Population: All objects that researchers want to describe or make inferences about.
Characteristic of population = “parameter” (N)
Sample: A sub-group of a population that the researcher believes represents the population.
A group of a specific size (n) is selected and measured.
Characteristic of a sample = “statistic”
Random Samples: Best for experiments because they are unbiased.
Every element of the population must be equally likely to be selected for the sample group.
The selection of one element does not affect the possibility of other elements being selected.
Variables
Characteristics of a person, object, or phenomenon that is amenable to change and measurable.
Any observable/measurable property of organisms, objects, or events.
Types of Variables
Quantitative (Numerical)
Qualitative (Categorical)
Quantitative Variables
Numerical data that can be added, subtracted, multiplied, and divided.
Examples:
Age (years), Blood pressure (mm of Hg)
BMI, Pulse per minute
Exercise in hours per week, Coffee drinking in ounces per day
Continuous vs. Discrete
Continuous: Can theoretically take on any value within a given range (e.g., height = 188.99955 cm).
Discrete: Can only take on certain values (e.g., number of children in a family, number of cities).
Qualitative Variables
Binary: Two categories
Examples:
Dead/alive
Treatment/placebo
Disease/no disease
Exposed/Unexposed
Heads/Tails
Did you have breakfast in the morning? (Yes/No)
More than Two categories
Example:
Hair colour – Blonde, Red-haired, Brown, and so forth.
Classification of Data or Levels of Measurement
Four levels of measurement:
Qualitative
Nominal
Ordinal
Quantitative
Interval
Ratio
A) Nominal
Data placed in categories (no ordering).
Cannot be quantified.
Mutually exclusive.
e.g., blood type, type of car owned, gender, colour of paint.
B) Ordinal
Data is ranked.
e.g., “Idol” contest, preference (first, second, third), mineral hardness, cancer stages, University ranking, Letter-grades.
C) Interval
Equal units of measurement assigned to the attribute.
Zero point is arbitrary!
Therefore, not proportional (or multiplicable).
e.g., temperature (F, C): temperature can be below 0 degree Celsius (-10 or -20).
D) Ratio
Same as interval but zero is absolute or true.
Zero indicates an absence of the variable.
Therefore, direct comparison can be made.
e.g., Age, distance, weight, time, money, etc.
Dependent and Independent Variables
Dependent Variable (DV)
The variable of primary interest; i.e., it is measured.
A variable whose changes we wish to study; a response variable.
The variable designed to measure the effect of the variation of the independent variable.
Independent Variable (IV)
A variable we believe affects the measurements obtained on the dependent variable; i.e., it is manipulated.
A variable whose effects on the dependent variable we wish to study.
The variable that the researcher changes within a defined range, to study the effect on the dependent variable.
Variables: Vitamin C study
Independent variables of daily vitamin C intake can determine the dependent variable of life span.
Scientists will manipulate the vitamin C intake in a group of 100 people: 50 people will be given a daily high dose of vitamin C and 50 people will be given a placebo pill over a period of 25 years. The goal is to see if the independent variable of high vitamin C dosage affects the people's life span
Experimental Control
“Ideal”: To imply causation, the experimenter eliminates the influence of all variables that could affect the DV except the one(s) directly manipulated.
All conditions are kept the same for all participants except for the effect of the IV.
“Reality”: Impossible to control all variables that could affect the DV.
Researchers control the variables they can.
Other influences that are not controlled are assumed to be randomized. (i.e., we assume the effects are “washed out” if they are “spread out” over the groups).
Statistical Methods
The researcher's tools:
To assist in describing data
In making inferences or generalisations from experimental data (sample) to larger groups (population)
In studying causal relationships
Topic 2: Organising and Displaying Data
Good research is based on collecting large amounts of data, which needs to be simplified.
Frequency distribution: Lists all possible data values or types and the frequency of occurrence of each one.
Meant to organise and describe the data in table form.
Use the frequency table to construct a frequency histogram (graph).
Reveal the pattern of the scores/observations.
Types of Frequency Distributions
a) Ungrouped: does not need grouping (blood)
Frequency of all the possible data values or items in your dataset.
Can be nominal/ordinal categories OR quantitative but small numbers of single values.
Already grouped
Blood type
b) Grouped (class intervals): bundled into chunks
Applies when all “possible data values” would be too many, so data is arranged and separated into groups called class intervals.
Each class interval includes a range of data
Bundle into chunks
Steps in Constructing a Frequency Distribution
Seven steps in constructing a Grouped frequency distribution.
Step 1: Count the number of scores
Step 2: Identify highest and lowest score (range)
Step 3: Identify the smallest unit of measurement
What is the smallest division (possible)
(i.e., by how much can your score increase
Step 4: Decide on the appropriate number of class intervals
Step 5: Decide on the score range of each class interval
Step 6: Round this class interval to make this range PRETTY An i (or class interval) of 16.57 or 14.5 would be ugly and a clunky range!
Instead, 20, 15, 12 or 10 would be prettier. So try a couple of them, starting with a class interval of 15. And maybe go smaller afterward.
Step 7: List class intervals of scores in order Make sure that intervals have:
Same width (range of numbers)
no overlap across intervals
no gaps
Ungrouped distributions
Use UNGROUPED (data that's already grouped, nominal, blood type, no order or connection)
when data are items rather than numbers, i.e., nominal or ordinal (qualitative) values
when can use all possible data values without being too many (< 15) e.g., a small number of possible discrete scores (e.g., how many courses this class is enrolled in this term)
In this case, the number of rows in the frequency table is clear, so don’t need steps, just list the items or values and start tallying
Grouped distributions
Use GROUPED data When data values are continuous (e.g., weight, time, blood pressure) or too many possible data values (e.g., age, or salary)
Thus, need a range of values known as the class intervals (i) or Bins; calculated in step 5
A good start is to first estimate what would be a good number of bins (step 4), but may need to redo
Steps 4 and 5 to get a “PRETTY” class interval or bin, e.g., 2, 5, 10, 12, 15, 20 etc (or 0.01, 0.2, 0.5 etc)
To correctly GROUP data
After calculating the class interval (range), add this i to the start of each bin, starting with the smallest score or value (or round down to have a pretty lower number) in the dataset
These bins should have (1) the same width/range (2) no overlap across bins (3) no gaps (unit change grams e.g.) (4) cover all the data in that set.
NOTE: Sometimes you will want to eliminate extreme scores from your data doing steps 2-6 to determine class intervals. But after determining the class intervals, I need to add this extreme score to the tally. All scores must be included in your distribution.
Graphs
Histograms only represent frequencies (ungrouped)
A pictorial representation of a frequency distribution or other data
Helpful in understanding concepts e.g., frequencies, and other summary data (next topics)
Two types of graphs used to plot frequency
Bar graphs for Grouped data
Histograms for Ungrouped data
Bar Graphs
Depict frequencies or other group statistics, e.g., described in the next topic as a vertical bar for each group or category
The groups or categories on the x-axis are nominal … so the order of the categories doesn’t matter
Separated by some space.
Bar graph with no space between the vertical space are Histograms which include interval/ratio data
Complex bar graphs (more info)
Histogram
A histogram uses vertical bars to depict the frequencies of an interval/ratio variable (but not other statistics).
A histogram differs from a bar graph in that it does not have spaces between the bars
Ungrouped FD is depicted by a bar graph while Grouped FD is depicted by a histogram
TOPIC 3: Measure of Central Tendency
A. Mean
B. Median (mdn)
C. Mode
A. Mean (Arithmetic average)
Calculate by hand from raw scores: Mean = ild{\Sigma}x/n (the sum of the scores divided by the number of scores)
Sensitive to extremes e.g. 1.42, 1.97, 1.42, 1.50, 1.67 = 7.98 Mean = 7.98/5 = 1.60
OR can calculate by computer (or statistics programs like Excel)
The mean of a sample of X scores is symbolised as x! (X-bar) ßstatistic
The mean of a population is symbolised by Greek letter \mu (mu) ß parameter
B. Median (middle score)
The median value indicates the midpoint in the given observations. It divides given values into 2 equal parts and locates the middle value. Count how many values you have, add 1, and divide by 2.
C. Mode
The values that occur most often (most common value in) e.g. 1,7,5,9,8,7 Mode = 7 e.g. 1,7,5,9,8 No Mode e.g. 1,7,7,5,9,9,8 Bimodal 7 and 9
However, mode is ONLY useful for categorical data, or possibly very large datasets of interval/ratio data.
Which measure to use?
Nominal – has its own category; no need to group it USE MODE (Can’t calculate mean or median): e.g. Law = 64; Kine = 59; Eng. = 37
Ordinal – there's a order to categories USE MEDIAN e.g. 1st, 2nd, 3rd, 4th, 5th
Interval or Ratio – USE Mean and/or Median
Use MEDIAN instead of mean if highly skewed distribution or if have outliers (median not as affected).
Excel: Calculating central tendency measures
Formula in Excel
Mean = average(data_set)
Median = median(data_set)
Mode = mode(data_set)
Or, given the Toolpak, run Descriptive statistics under Data Analysis
Graphs for summary data
The distribution of data, including central tendencies like mean and medians, can also be graphically depicted
Bar Graphs & Histograms
Bar graphs and Histograms can be used to plot means (or medians) of the dependent variable (DV) along the y-axis across
Bar graphs: x-axis represents different groups or categories (nominal); and have space between bars
Histograms: x-axis plots the independent variable that are interval/ratio
Key characteristics of Graphs
The distribution of your sample data (for interval and ratio units of measure)
Peaks
Spread
Kurtosis (leptokurtic, platykurtic
Symmetry/Non Symmetrical distributions (POSITIVE & NEGATIVE)
1) PEAKS
The tallest cluster/s of bars:
Represent the most common values/bulk of data: the modE
2) Spread
How much does the data vary?
3) Kurtosis
The relative peaked-ness or flatness of the distribution.
It reflects whether the scores are more or less evenly distributed throughout the measurement range.
Leptokurtic
The scores are bunched together with steeply sloping sides.
Platykurtic
The scores are more evenly spread out:
A greater proportion of the scores fall toward the ends, or tails.
5) Symmetry/ Normal distribution
A distribution is termed Symmetrical when the data frequencies decrease at equal rates above and below a central point.
Visually Bisected (One half is mirror image of the other): Mean = Median = Mode
6) Non-Symmetrical distributions Skewed (positive or negative)
bunching of the observations at one or the other end of the measurement range
Skewness of data
If you wanna know if your data is skewed (median - mean \times 3/SD)
If it's one is not, 1-3 skewed, above 3 extremely skewed
Bimodal data
Why it’s not sensible to rely only on summary statistics like central tendency. SHOULD ALWAYS PLOT!
Distributions with outliers
Outliers, like skewed distributions, also affect the mean but not the median. Mean will “lean” toward the extreme – outliers or longer tail.
Examples of distribution
Examples of Normal distribution e.g. height, birth weight, errors in measurements or distance from bull’s eye, blood pressure, RT, MT, marks on a test (if not too easy or hard), IQ, GPA, shoe size, hours slept
Examples of Skewed data
Some variables are more likely to be positively skewed: income, scores on a difficult test, numbers of kids or cars or broken bones etc, points scored in a game, variables with a lower limit like weight.
Some variables are more likely to be negatively skewed: age at death/lifespan, scores on an easy test, income as a function of age, variables with an upper limit (100%).
A few variables are bimodal – usually reflect a combination of two distributions; peak restaurant hours, book prices, or height when include men and women. Bimodal means two peaks/mode but peaks/mode don’t need to be equal.
TOPIC 4: MEASURE OF VARIABILITY
Measure of Variability
A measure of variability is a single number that describes the spread in a set of data.
How much do the scores vary from one another?
Most common measures:
Range
Standard deviation (SD) = \sqrt{variance}
1. Range
Total spread in data
Range = highest score – lowest score
E.g. 4, 5, 7, 9 Range = 9 – 4 = 5
Interpretation – score range over which 100% of scores fall.
Advantage – very quick. – can be used for all levels of measurement
Disadvantage – influenced by single extreme scores
\Sigma X = add up Average Deviation
How much does each score deviate (vary) from the mean?
E.g. 14, 12, 9, 17, 8 N=5 Mean = 60 / 5 = 12
How much each score is from the mean?
14 – 12 = 2
12 – 12 = 0
9 – 12 = – 3
17 – 12 = 5
8 – 12 = – 4
Note 1: Add up the deviations = 0. ALWAYS!
Note 2: Use absolute values (i.e. ignore negative sign). = 14. Now divide by N E.g. 14 / 5 = 2.8
Simple but never used.
Is there another method to remove the negative sign?
Variance (\sigma^2 Greek letter sigma squared )
Sum of Squares
The sum of the squared deviations from the mean, \sum (X - \mu)^2
Always a positive value
1. Variance
When divide the Sum of Squares by n (statistically known as the variance)
2. Standard Deviation
Measure of variability for scores about the mean.
Measure of the deviations of all the scores from the mean, expressed as a single number. Note this equation is different than the one for population two slides ago which ILLUSTRATES better what SD is. But this equation is similar but meant for samples (and what you need to know for this course).
Interpretation
SD – the method of specifying % of scores falling within certain score limits around the mean. e.g. What does Mean = 14 ± 2 imply?
68% of scores fall within the score range of 12–16 e.g. What if Mean = 14 ± 1?
68% of scores between 13–15 e.g. What if Mean = 14 ± 3?
68% of scores between 11–17
Quick but crude estimate: SD = range / 4
Mean vs. Median
With extreme scores, the mean is not a good measure of central tendency. How much skewing before using median instead of mean?
Rule of Thumb: if the mean and median differ by 1 SD or more, then use the median
TOPIC 5 PERCENTILES AND Z SCORES
Standard Normal Distribution
Interquartile Range (IQR)
IQR is a measure of variability, based on dividing a dataset into quartiles
Quartiles divide a rank-ordered data set into four equal parts
The values that divide each part are called the first, second, and third quartiles; and
they are denoted by Q1, Q2,and Q3, respectively
Percentiles
Calculate the percentile of a raw score from ungrouped data
Calculate the percentile of a raw score from grouped data
Calculate the raw score of a given percentile
Relative Scores: Percentiles and Z-Scores
A method of describing the standing of an individual in relation to a group.
Common use with norms e.g., height/weight tables, age, fitness level, exams.
Achieved by translating an individual’s raw score into either percentile or z-score (transformation)
Raw scores refer to the original measure e.g., height, % on exam
transform to letter grade.
Percentiles
Percentile score = the PERCENTAGE of people in the group who have the same raw score or a lower raw score, than the one in question.
Note: 50th percentile = median
e.g. Exam marks out of 20. Your score = 10
Express the raw score of 10 as a percentile
What % of total scores (n=11) fall at or below 10? 2, 9, 6, 5, 16, 15, 10, 8, 7, 4, 1
What % of total score (n=11) fall at or below 10?
Order the scores
Locate the score
Calculate the percentile = ordinal rank of a given value/Number of values in the data set
1,2, 4, 5, 6, 7, 8, 9, 10, 15, 16
10 - 9th score (9/11)*100=81.818 = 82nd percentile * 100
1, 2, 4, 5, 6, 7, 8, 9, 10, 15, 16
9th score
9/11 x 100 = 81.82 = 82nd percentile
Percentiles for a Normal Distribution
Formula for Percentile from Grouped Data
x = the score you are converting to percentile
LL = lower limit of the class interval that contains the score
i = the size of each class interval
fw = the frequency of scores in the interval that contains the score
\sum fb = sum of the scores below the interval
N = number of scores in the dataset
Percentiles
Also possible to calculate in reverse
calculate what raw score a certain percentile represents
For example, if we want to cut or drop the bottom 30% of applicants
We need to calculate the cut-off score at the 30th percentile
Then, cut the applicants at or below this score
Finding the Raw Score of a Given Percentile
Bottom 30% of applicants?
Same symbol as before with one addition, P = percentile as a decimal
PN = (0.3016) =4.8 = 5 (ALWAYS ROUND)
5 scores from the bottom of the table = 0.28 to 0.31 interval
Z-Scores
How to calculate a Z-Score
Three uses of Z-Scores
Z Scores
The connection between percentiles and Z-Scores:
Both are transformations from a raw score to another scale
Both can be used to compare an individual with respect to a group
But Z-Scores differ because:
Z-Scores use an interval scale (equal intervals)
Based on SD (standard deviation) as a metric
Three uses of Z-Scores
Compare an individual relative to the group OR calculate % of people falling within a certain score range
This is the same as percentiles, but the scale has equal intervals
Compare score units which are different. e.g. height and weight
Calculate probabilities under the normal curve – used for confidence intervals and statistics tests; i.e. Z-Scores are used as a basis for other statistics
Z-Score Formula
Where:
x = raw score
\bar{X} = mean
SD = Standard Deviation
Z-Scores
Given a sample mean (\bar{X}) of 5 ± a SD of 1, Determine the Z-Score for each example (raw score):
x = 6: Z = (6-5)/1 = +1 or 1
x = 2: Z = (2-5)/1 = -3
x = 5: Z = (5-5)/1 = 0
\bar{X} = 20; SD = 5 cm
Z-Scores
The transformation from a raw score to Z-Score is achieved by changing raw score units to SD units
This works because of the normal distribution curve
Advantages
Uniform Scale
Disadvantages
Meaning not immediately clear
Using Z-Scores to Calculate %
Area under normal curve can be described
% under portions of the curve is described
Z-Score of 1 = 1 SD = 0.3413 or 34.13% (34%) of curve
Therefore, use Z-Scores to calculate percentiles or % of cases falling in certain score ranges.
Z-Score Concept
The transformation from a raw score to Z-Score is obtained by changing raw score units into SD units
This works because the assumption here is that we have a normal distribution, so irrespective of score units, certain proportions of scores fall within certain SD units from the mean of any distribution
This distribution is called the normal curve
Advantages:
Uniform scale, such as SD units
Disadvantages:
Z-Score meaning not immediately obvious
Example: \bar{X} = 20, SD = 5
Find percentile for score 25: Z = (x - \bar{X})/SD = (25 - 20)/5 = 1.00
Z of +1.00 = 34.13
50 + 34.13 = 84.12 84th percentileFind the percentile for portion A: x = 15
Z = (15 - 20)/5 = -1
Z of -1 = 34.13
50 - 34.13 = 15.87 16th percentile
Mean = 1.70, SD =0.10 seconds
a) What is the Z-Score for 1.75 seconds?
Z = (1.75-1.70)/0.1 = 0.5
area =0.1915 or 19.15% (use table)
b) What is the percentile for 1.75 seconds?
percentile = 50+19.15 = 69.15 or 69th
Mean = 1.70, SD =0.10 seconds
a) What is the Z-Score for 1.62 seconds?
1.62-1.70/0.10 = -0.8
area =0.2881 or 28.81%
b) What is the Percentile for 1.62 seconds?
50-28.81= 21.19
How to calculate the number of people rather than Percentile?
Need to know class size, mean, and SD
Example:
\bullet n = 100 mean = 1.70 SD = 0.10 seconds
Percentile for 1.62 = 21.19 or 21st
\bullet What if n = 250; How many people score ≤ 1.62?
\bullet n = 250 mean = 1.70 SD = 0.10 seconds
Z-Score = 1.62 Percentile = 21.19
(21.19/100) * 250
= 52.975 = 53 people
More Examples Using the Z-Score Concept
If we have a group of people who’s intelligence scores (IQ) were equal to a mean of 100 and a SD of 15 (100 \pm 15)
1) What % of group have IQ:
a) >100
b) >120
Step 1: Z = (120-100)/15 = 1.33
Step 2: Z1.33 = 0.4082
Step 3: 50-40.82 = 9.18%
2) If a sample of n = 140, how many will have an IQ > 120?( 9.18/100) * 140 = 0.0918 * 140 = 12.853 = 13
Three “Calculations” with Z-Scores
Finding the area > x (red shade)
Finding the area ≤ x (i.e. also percentile) (Blue shade)
Finding the area > X ≤ x
T-Scores
Way of converting Z scores to an understandable form (i.e. “reshape” the distribution).
T-Scores Examples
a) Z = -0.36; what is