Statistics Exam Notes
Introduction to Statistics
Overview:
Introduction to Statistics
Population and Sample
Data and Variables
Levels of Measurement
Introduction to Add-in Data Analysis Toolpak into Excel
Reading: Horvath (p 1-15)
Statistics: What’s the Point?
H. G. Wells (1903) hypothesized: “Statistical thinking would one day be as necessary for good citizenship as the ability to read and write”
Consequences of mathematical innumeracy aren't as obvious as illiteracy.
Lack of numerical perspective.
Misunderstanding of probability.
Why Study Statistics?
Understanding scientific evidence requires knowledge of statistical procedures.
Statistics helps in evaluating arguments responsibly against those trying to influence behavior with statistical arguments.
Science News Headline: Diet rich in animal protein is associated with a greater risk of early death
Journal Reference: Virtanen, et al. Dietary proteins and protein sources and risk of death: the Kuopio Ischaemic Heart Disease Risk Factor Study. The American Journal of Clinical Nutrition, 2019. DOI: 10.1093/ajcn/nqz025
Results:
Average follow-up of 22.3 years showed 1225 deaths due to disease.
Higher total and animal protein intakes had borderline statistically significant associations with increased mortality risk.
Multivariable-adjusted HR (95% CI) in the highest compared with the lowest quartile:
Total protein intake: 1.17 (0.99, 1.39; P across quartiles = 0.07).
Animal protein intake: 1.13 (0.95, 1.35; P = 0.04).
Higher animal-to-plant protein ratio (extreme-quartile HR = 1.23; 95% CI: 1.02, 1.49; P-trend = 0.01) and higher meat intake (extreme-quartile HR = 1.23; 95% CI: 1.04, 1.47; P = 0.01) were associated with increased mortality.
Association of total protein with mortality was more evident among those with a history of type 2 diabetes, cardiovascular disease, or cancer.
Intakes of fish, eggs, dairy, or plant protein sources were not associated with mortality.
What is Statistics?
The science of the collection, organization, analysis, and interpretation of data.
A set of procedures and principles for collecting and organizing data and analyzing information to help people make decisions when faced with uncertainty.
Big Picture Goal:
To take data from a sample and make conclusions about the population.
Purpose of Data:
To get necessary information and knowledge.
Data + Interpretation = Information + Analysis, Discussion, Inferences = Knowledge
"Data" is not "information" unless it is "interpreted"
Case Study: How Nimble Are Your Fingers?
Manual dexterity test: How many small pieces can you assemble in one minute?
Data: 200 students from a large statistics class.
Question: Which gender (male = 1) has better manual dexterity?
Summarizing the data involves measures of central tendency, variance measures, minimum and maximum scores, and the number of participants.
Simple summaries of data tell an interesting story and are easier to digest than large quantities of information.
Data are used to make a judgment or decision about a situation. This is what statistics is all about.
The Discovery of Knowledge:
Asking the right question(s).
Collecting useful data, including deciding how much is needed.
Summarizing and analyzing data, with the goal of answering the question(s).
Making decisions and generalizations based on the observed data.
Turning the data and subsequent decisions into new knowledge.
Two Types of Statistics
Descriptive Statistics:
Organize, describe, and summarize a small dataset.
Results obtained represent the entire dataset.
Can constitute the first stage of analysis.
Often used when researchers begin a new area of investigation.
Example: Deaths by Social Class (N=1316).
1st class: SES (67% Men, 3% Women, 38% Total)
2nd class: SES (92% Men, 14% Women, 59% Total)
3rd class: SES (84% Men, 54% Women, 66% Children, 62% Total)
Total: SES (82% Men, 26% Women, 48% Children, 62% Total)
Inferential Statistics:
Conclusions about populations are derived from small (random) samples.
Uses a smaller dataset to make estimates and draw conclusions about the greater population (that the sample is drawn from).
Can be used to determine cause and effect relationships, test hypotheses, and make predictions.
Population and Sample
Population:
All of the objects that researchers want to describe or make inferences about.
Characteristic of population = “parameter”.
Sample:
Sub-group of population that researcher believes represents the population.
A group of specific size (n=) is selected and measured.
Characteristic of sample = “statistic”.
Needs to be a good estimate of the populations parameters. Cannot measure population (too big).
Take a sample - a smaller group - easier to measure (e.g. 72 y10 = sample =all ages).
Choosing a Sample from Population:
DO NOT DO:
Overrepresentation of population (e.g., 20% homeless/40% healthy/40% jobless X - sample).
Not a good representation of population = should include all percentages!
Experimental designs: manipulation.
Non-experimental designs: observation.
The best samples for experiments are those which are selected randomly!
Random sample means that the sample is unbiased.
To achieve this - every element of the population must be equally likely to be selected to the sample group.
Selection of one element does not affect the possibility of other elements being selected.
Structure of Data
Observations (= individuals or cases).
Variables = observations’ attributes.
Data refers to any recorded observation, and are usually numeric.
Experimental designs = manipulation = IV to DV.
Variables
Characteristics of a person, object, or phenomenon that is amenable to change and is measurable.
Any observable/measurable property of organisms, objects, or events.
Types of Variables:
Quantitative (Numerical).
Qualitative (Categorical).
Quantitative Variables:
Numerical data that you can add, subtract, multiply, and divide.
Examples: Age (years), Blood pressure (mm of Hg), BMI, Pulse per minutes, Exercise in hours per week, Coffee drinking in ounces per day.
Quantitative Variables: Continuous vs. Discrete
Continuous: can theoretically take on any value within a given range (e.g., height=188.99955… cm).
Discrete: can only take on certain values (e.g., no. of children in a family, No. of cities).
Continuous examples include a continuous spectrum variable of rainbow.
Discrete examples is a first ,economy , business class.
Qualitative Variables
Binary: Two categories.
Examples: Dead/alive, Treatment/placebo, Disease/no disease, Exposed/Unexposed, Heads/Tails, Did you have breakfast in the morning? (Yes/No).
More than Two categories
Example: Hair color – Blonde, Red-haired, Brown, and so forth.
Classification of Data (Levels of Measurement)
Four levels of measurement:
Nominal
Ordinal
Interval
Ratio
More information is conveyed as one moves from A to D.
Scales of measurement can be either Qualitative OR Quantitative.Nominal:
Data placed in categories (no ordering).
Cannot be quantified.
Mutually exclusive (e.g. TYPES).
Examples: blood type, type of car owned, gender, colour of paint.
Comparative (NAMES NUMBERS among different categories).
KIN (400 students), PSYC (200 students), BIOL (10 students).
e.g. most students in which major ? -> no "average"
Blood type can't be combined, just compare categories.
Ordinal:
Data is ranked.
Examples: “Idol” contest, preference (first, second, third), mineral hardness, cancer stages, University ranking, Letter-grades.
Used to organize data in order then nominal to categorized compare.
Small -> big high-low close -far Poor rich tall-short heavy light.
Interval:
Equal units of measurement assigned to the attribute.
Zero point is arbitrary!
Therefore not proportional (or multiplicable).
e.g. temperature (F, C): temperature can be below 0 degree Celsius (-10 or -20).
Zero value does not mean "zero" -> it has a meaning.
addition and subtraction only
Ratio:
Same as interval but zero is absolute or true.
Zero indicates an absence of the variable.
Therefore, direct comparison can be made.
Examples: Age, distance, weight, time, money etc.
Zero value means "zero"- nothing.
Numerical & means nothing = absent division / multiplication/addition/subtraction
Dependent and Independent Variables
Dependent Variable (DV)
The variable of primary interest (i.e. it is measured).
A variable whose changes we wish to study (a response variable).
The variable designed to measure the effect of the variation of the independent variable (outcomes).
Independent Variable (IV)
A variable we believe affects the measurements obtained on the dependent variable; i.e., it is manipulated.
A variable whose effects on the dependent variable we wish to study.
The variable that the researcher changes within a defined range, to study the effect on the dependent variable (predictors).
Variables Example: Vitamin C study
Independent variable of daily vitamin C intake can determine the dependent variable of life span.
Scientists will manipulate the vitamin C intake in a group of 100 people: 50 people will be given a daily high dose of vitamin C and 50 people will be given a placebo pill over a period of 25 years.
The goal is to see if the independent variable of high vitamin C dosage affects the people's life span.
Experimental Control
“Ideal” Scenario:
To imply causation, the experimenter eliminates the influence of all variables that could affect the DV except the one(s) directly manipulated.
All conditions are kept the same for all participants except the effect of the IV.
“Reality” Scenario:
Impossible to control all variables that could affect the DV.
Researchers control the variables they can.
Other influences that are not controlled are assumed to be randomized (i.e., we assume the effects are “washed out” if they are “spread out” over the groups).
To assist in describing data.
In making inferences or generalizations from experimental data (sample) to larger groups (population).
In studying causal relationships.
Introduction to Excel
Microsoft Excel is a useful spreadsheet software.
Use it to enter all sorts of data and perform financial, mathematical or statistical calculations.
Open an Existing Excel Workbook
On the File tab, click Open.
Create a New Excel Workbook
On the File tab, click New.
Click Blank workbook.
Excel worksheet.
Analysis ToolPak
An Excel add-in program that provides data analysis tools for financial, statistical and engineering data analysis.
Analysis ToolPak add-in
On the File tab, click Options.
Under Add-ins, select Analysis ToolPak and click on the Go button.
Check Analysis ToolPak and click on OK.
Analysis group click on Data Analysis (Data tab).
Dialog box appears
Select Histogram and click OK to create a Histogram in Excel
Organizing and Displaying Data
How to make sense of our data?
Good research is based on collecting large amounts of data, which needs to be simplified.
Frequency distribution – lists all possible data values or type, and the frequency of occurrence of each one.
Meant to organize and describe the data in table form.
Use frequency table to construct a frequency histogram (graph).
Reveal the pattern of the scores/observations.
Types of Frequency Distributions:
Ungrouped:
Frequency of all the possible data values or items in your dataset.
Can be nominal/ordinal categories OR quantitative but small number of single values.
Grouped (class intervals):
Applies when all “possible data values” would be too many, so data are arranged and separated into groups called class intervals.
Each class intervals includes a range of data.
Types of Frequency Distributions:
Ungrouped: Categorical (Blood type, Majors, Teams) Quantitative: Number of kids in a household, number of town/cities you have lived in, etc
Grouped (class intervals): Annual salary, reaction times for any of motor tasks, weight, commuting time to York Continuous values (need a range) but can be discrete (e.g. age)
Example:
Ungrouped Frequency Distribution:
Chin-up scores: 7, 15, 14, 9, 8, 13, 12, 15, 8, 12, 9, 9, 10, 13, 11, 10, 12 (N=17).
X = 15 Tally marks = II.
X = 14 Tally marks = I.
X = 13 Tally marks = II.
X = 12 Tally marks = III.
X = 11 Tally marks = I.
X = 10 Tally marks = II.
X = 9 Tally marks = III.
X = 8 Tally marks = II.
X = 7 Tally marks = I.
Example:
Ungrouped Frequency Distribution w Frequency and Cumulative f.
X Frequency Cumulative f:
15 2 17
14 1 15
13 2 14
12 3 12
11 1 9
10 2 8
9 3 6
8 2 3
7 1 1
Example:
Grouped Frequency Distribution:
Chin-up scores (same scores as previous ex.):
CLASS INTERVAL FREQUENCY (f):
14-15 : 3.
12-13 : 5.
10-11 : 3.
8-9 : 5.
6-7 : 1.
Steps in Constructing a Frequency Distribution
Step 1: Count the number of scores (N = 50)
Step 2: Identify highest and lowest score
MAX = 368 and MIN = 252. Range is 116.
Step 3: Identify smallest unit of measurement
Smallest unit = 1.
Step 4: Decide on appropriate number of class intervals (interval of 7; 7 rows in frequency distribution).
Step 5: Decide on the score range of each class interval (i):
Step 6: Round this class interval to make this range PRETTY
Choose 20, 15, 12 or 10 instead of 16.57 or 14.5.
Step 7: List class intervals of scores in order
Try a class interval of 15 (pretty range).
Need a starting value for first class interval that is also pretty.
Min is 252. Thus, round down to 250!
Class intervals (of 15): 355-369, 340-354, 325-339, 310-324, 295-309, 280-294, 265-279, 250-264.
Begin with the smallest values for the smallest class interval bin. Then add “i” (i.e., 15) to the next bin up until almost max.
Make sure that intervals have:
Same width (range of numbers)
No overlap across intervals
no gaps
*Need more intervals, and thus more rows
Frequency Distributions
Ungrouped distributions Use UNGROUPED.
When data are items rather than numbers, i.e., nominal or ordinal (qualitative) values.
When can use all possible data values without being too many (< 15) e.g. small number of possible discrete scores (e.g. how many courses this class is enrolled in this term).
Just list items and values and start tallying when the the number of rows needed is clear.
Grouped distributions Used GROUPED data.
When data values are continuous (e.g. weight, time, blood pressure) or too many possible data values (e.g., age, or salary).
Calculate step 5, a range of values known as the class intervals (i) or Bins
A good to start is to first estimate what would be a good number of bins (step 4), but may need to redo steps 4 and 5 to get a “PRETTY” class interval or bin, e.g. 2, 5, 10, 12, 15, 20 etc (or 0.01, 0.2, 0.5 etc)
*Add this i to the start of each bin, starting with the smallest score or value in the dataset. Cover all the data with following: (1)same width/range (2) no overlap across bins (3) no gaps (4).
Example for YOU TO DO: Construct a frequency distribution.
Data: RTs for participants .31, .27, .28, .29, .30, .25, .26, .27, .31, .34, .27, .28, .28, .29, .32
Steps
1. N = 15 (# of scores)
2. 0.34 – 0.25 = 0.09 (highest – lowest)
3. 0.01 (smallest unit of measurement)
4. 5 (number of categories) ß may change this afterwards
5. i = 0.09/5 = 0.018 (step 2/step 4)
6. 0.02 (round to “pretty number”).
Class -Class Interval :0.34-0.35,0.32-0.33,0.30-0.31, 0.28-0.2,0.26-0.27, 0.24-0.25,
Count I I II III IIIII IIII I
-Frequency 1 1 3 5 4 1Cumulative Freq 1 3 13 10 5 1
NOTE: Sometimes you will want to eliminate extreme scores from your data doing steps 2-6 to determine class intervals. But after determine the class intervals, need to add this extreme score to the tally. All scores must be included in your distribution.
Graphs:
A pictorial representation of a frequency distribution or other data.
Helpful in understanding concepts, e.g. frequencies, and other summary data.
Bar graphs for Grouped data.*
Histograms for Ungrouped data.
Graphs for Number of Students Enrolled showing values related to faculties and school
Graphs for Total yards offence in a session (Simple and Complex).
Histogram:
A histogram uses vertical bars to depict frequencies of an interval/ratio variable with no spaces between the bars.
Ungrouped FD is depicted by a bar graph while Grouped FD is depicted by a histogram.
Histogram Examples: Age groups and frequencies
Frequency Histogram vs Polygon (line graph)
Excel Exercise Textbook Horvath page (53-75)
Constructing Frequency Histogram using Data Analysis Tool Example: ages of students.
Enter the data in the excel sheet.
Select Tools/Data Analysis on the Standard Toolbar.
Select Histogram from the Analysis Tools window.
Input data to construct histogram.
Format the histogram to add title and labels.
Measure of Central Tendency
Center of data set.
A single summary number which indicates where many of the scores lie.
Measure of Central Tendency:
Mean: Arithmetic average. . For example: 1.42, 1.97, 1.42, 1.50, 1.67 = 7.98 => .
Median: Middle value when data is ordered. To calculate the location of the medium:
Mode: The value that occurs most often (most common value in).
Which Measure to Use?
Nominal – Mode (Can’t calculate mean or median): e.g. Law = 64; Kine = 59; Eng. = 37.
Ordinal – Median e.g. 1st, 2nd, 3rd, 4th, 5th.
Interval or Ratio – Mean and/or Median
Use median instead of mean if highly skewed distribution or if have outliers (median not as affected).
*Excel: Calculating central tendency measures with these equations: Mean = average(dataset) Median = median(dataset) Mode = mode(data_set).
*Excel output must account for:
Too many decimal places. For labs and most places, only 1-2 more decimal places than the data provided for non-integer values But for non-integer numerical answers on midterms, just to be safe, use all decimal places (paste the entire cell) OR < 10, use 2 <1, use 4 <0.01, use 5.
Graphs for Summary Data
Distribution of data, including central tendencies like mean and medians, can also be graphically depicted
Bar Graphs & Histograms
Bar graphs: x-axis represents different groups or categories (nominal); and have space between bars histograms: x-axis plots the independent variable that are interval/ratio
Graphs are created to compare relationships. Other graphs include dot plot and line plot. Key parts of the distribution come back to:Peaks Identify the peaks.
Represent the most common values/bulk of data: the mode.Spread How much the data vary? Related to our next topic.
Kurtosis
The relative peaked-ness or flatness of the distribution.
It reflects whether the scores are more or less evenly distributed throughout the measurement range
Leptokurtic The scores are bunched together with steeply sloping sides.
Platykurtic The scores are more evenly spread out: Greater proportion of the scores fall toward the ends, or tails
Symmetry
Normal distribution. A distribution is termed Symmetrical when the data frequencies decrease at equal rates above and below a central point *Skewed - bunching of the observations at one or the other end of the measurement range -
Positively Skewed: observations are bunched at the lower score values
Negatively Skewedobservations are bunched at the higher score values.Distributions can also be Bimodal in nature.
The Mean is more affected than the Median if these instances occur
Distributions with outliers should cause concern
Normal Distribution
e.g. height, birth weight, errors in measurements or distance from bull’s eye, blood pressure, RT, MT, marks on a test (if not too easy or hard), IQ, GPA, shoe size, hours slept * For large N, e.g., N > 100
Many variables* closely follow a normal distribution
Normal Distribution
Positively Skewed: income, scores on a difficult test, numbers of kids or cars or broken bones etc, points scored in a game, variables with a lower limit like weight. For large N.
Negatively Skewed: age at death/lifespan, scores on an easy test, income as a function of age, variables with an upper limit (100%) For large N.
*Bimodal means two peaks/mode but peaks/mode don’t need to be equal. *
Graphs and distributions can take on all sorts of shapes.
Measures of Variability
Measure of variability is a single number which describes the spread in a set of data.
Example : {7, 12, 10, 8, 13} and {0, 5, 10, 15, 20} (Mean = #+ for both sets). However, the range is more spread out in the second set.
Most common measures:
Range
Standard Deviation (SD)
Range:
Total spread in data
e.g. 4, 5, 7, 9 (Range = 9 – 4 = 5).
Interpretation – score range over which 100% of scores fall
Advantage–very quick.
can be used for all levels of measurement.
Disadvantage– influenced by single extreme scores
With Examples Highlighting How Ranges Can Be Similar, Even if the Mean isn't.
Average Deviation
How much does each score deviate (vary) from the mean?
e.g. 14, 12, 9, 17, 8. N=5 and Mean =
*How each Score is Accounted From The Mean
*NOTE: Add up the deviations = 0!
*Use absolute values (i.e. ignore negative sign). Now divide by N e.g.
This is not used because is There another method to remove the negative sign?
Variance
Sum of Squares:
The sum of the squared deviations from the mean,
Always a positive value
Variance = divide Sum of Squares by n (statistically known as the variance)
*Formula:
Suppose these data represent age to the nearest year of eight persons. How would it be accounted for?
The Spread Depends on the Deviation!
Standard Deviation
Measure of variability for scores about the mean.
Measure of deviations of all the scores from the mean, expressed as a single number.
*Sample statistic, SD or s:SD-method of specifying % scores falling within certain score limits around the mean
Quick but crude estimate: , if there are no extreme scores
Formulas
Calculate with the Following Information:
Dataset A
\overline{X} = 0.50 \ ΣX2 = 2.37
Dataset B
What can you say about the mean velocity and the variability (SD) in the 2 groups?
Dataset A
Dataset B
Interpretation of Mean and SD
Quick Reference for Means, Medians,and Distributions
*With extreme scores the mean is not a good measure of central tendency.
*Rule of Thumb: if mean and median differ by 1 SD or more then use median
Measure of Variability Across Various Methods
Nominal
Mode
Range
Ordinal
Median
Range
IntervalMean(* 𝑎𝑙𝑡ℎ𝑜𝑢𝑔ℎ 𝑚𝑒𝑑𝑖𝑎𝑛 𝑖𝑠 𝑏𝑒𝑡𝑡𝑒𝑟 𝑖𝑓 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑖𝑠 𝑠𝑘𝑒𝑤𝑒𝑑)
SD
Ratio Mean (* 𝑎𝑙𝑡ℎ𝑜𝑢𝑔ℎ 𝑚𝑒𝑑𝑖𝑎𝑛 𝑖𝑠 𝑏𝑒𝑡𝑡𝑒𝑟 𝑖𝑓 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑖𝑠 𝑠𝑘𝑒𝑤𝐸𝑑) SD.
Percentiles and Z-Scores
Inter-Quartile Range (IQR):
IQR is a measure of variability, based on dividing a dataset into quartiles.
Quartiles divide a rank-ordered dataset into four equal parts.
Values that divide each part are called the first, second, and third quartiles and denoted by Q1, Q2, and Q3.
Percentiles Raw scores: Percentile Raw score!
Relative Scores: Percentiles and Z-Scores
A method of describing an individual’s standing in relation to a group, common use with norms: height/weight tables, age, fitness level, exams.
Achieved by translating an individual’s raw score into either percentile or z-score (transformation)
Raw scores refer to original measure e.g. height, % on exam > transform to letter grade.
Percentiles Percentile score = the PERCENTAGE of people in the group who have the same raw score or a lower raw score, than the one in question.
*If you have not calculated, there is a need to order the numbers and solve for where the location is
Equation for percentile ranking {\frac{ordinal\ rank\ Of\ a\ given\ value}{ total\ # \ of\ values } *100}
Formula for Percentiles from Grouped Data -:
Where:
x = the score you are converting to percentile
LL = lower limit of the class interval that contains the score
i = the size of each class interval
fw = the frequency of scores in the interval that contains the score
∑fb = sum of the scores below the interval
N = number of scores in the data set
Finding the Raw Score of a Given Percentile -
Also possible to calculate in reverse: calculate what raw score a certain percentile represents
Formula:
All symbols as before with the addition of