Statistic Test One Notes
Basic Statistical Terms
Population
Entire group of people or things being studied
Sample
A subset (smaller portion) of the population
Variable
A characteristic or factor that can change or vary
Data
The values of variables
Can come from observations, counts, measurements, or responses
Census
Data collected from the entire population
Parameter vs. Statistic
Parameter
A numerical value that describes a population
Example: Average age of all people in the U.S.
Statistic
A numerical value that describes a sample
Example: Average age of people from a sample of three states
Key Idea
Population → parameter
Sample → statistic
Types of Studies
Observational Study
No attempt is made to control individuals or variables
Researchers only observe and record
Used when control is difficult, unethical, or impossible
Examples:
Smoking and lung cancer
Premature birth and reading skills
Public opinion surveys
Experimental Study
Researchers control one or more variables
Subjects are randomly assigned to groups
A treatment is applied
Examples:
Music type and cognitive test performance
Pesticide use and crop yields
Descriptive vs. Inferential Statistics
Descriptive Statistics
Organize, summarize, and display data
Describe what the data shows
Inferential Statistics
Use sample data to make conclusions about a population
Example conclusion: Married men tend to live longer than unmarried men
Qualitative vs. Quantitative Data
Qualitative (Categorical) Data
Descriptive, non-numerical
Examples:
Color of a car
Type of music
Quantitative Data
Numerical measurements or counts
Can be:
Discrete (countable): number of siblings
Continuous (measured): GPA, gallons of water
Methods of Data Collection
Survey
Collects data by interview, phone, mail, or internet
Example: Approval rating of the U.S. president
Observational Study
Observe and measure characteristics
Example: Children’s behavior study
Experiment
Apply a treatment and observe responses
Example: Cinnamon extract and heart disease risk
Simulation
Uses models (often computers)
Example: Crash tests using dummies
Experimental Design & Control
Control
Reduce effects of variables not being studied
Confounding Variables
When effects of different factors cannot be separated
Placebo Effect
Subject responds even though no real treatment was given
Blinding
Subject does not know if they received treatment or placebo
Double-Blind Experiment
Neither subject nor researcher knows who received treatment
Sampling Techniques
Random Sampling
Every population member has an equal chance of selection
Simple Random Sample
Every possible sample of the same size has an equal chance
Other Sampling Methods
Stratified Sample
Divide population into groups (strata) and sample each
Cluster Sample
Divide into clusters and sample entire clusters
Systematic Sample
Select every kth member after a random start
Convenience Sample
Easy to collect but often biased (not recommended)
Experimental Designs
Completely Randomized Design
Subjects randomly assigned to treatment groups
Randomized Block Design
Subjects grouped by similar traits, then randomized
Matched-Pairs Design
Similar subjects paired; each receives different treatment
Sample Size & Replication
Sample Size
Number of subjects in a study
Larger samples → more reliable results
Replication
Repeating an experiment with many subjects
Misleading Data
Data can be misleading when the sample is biased
Bias reduces the accuracy of results
Bias in Studies
Biases disrupt the validity of conclusions
Results from biased samples are not accurate or reliable
Evaluating a Study
Ask: Is the study impartial?
Determine whether the sample fairly represents the population
Example of Bias
Surveying only college students to estimate opinions of all adults
This creates sampling bias because not all age groups are represented
Key Idea
Biased samples → inaccurate and invalid results
Outlier
A value that is much higher or much lower than the rest of the data
Does not follow the overall pattern of the data set
Why Outliers Matter
Can skew results, especially the mean (average)
May make data misleading
Can affect conclusions
Causes of Outliers
Measurement or recording errors
Unusual but valid values
Data entry mistakes
Key Idea
Outliers should be investigated, not automatically removed
Qualitative Data
Data that describes or categorizes attributes of a population
Usually expressed using words or letters
Also called Categorical data
Phrases like which type or what kind indicate qualitative data.
Quantitative Data
Data that results from counting or measuring
Always expressed using numbers
Represents numerical values of attributes
Phrases like how many or the number of indicate that data is quantitative.
Discrete Data (type of quantitative data)
Countable numbers
No fractions or decimals
Continuous Data (type of quantitative data)
Data that can take any value in a range
can include fractions and decimals
Continuous data is defined as the type of quantitative data that is the result of measuring
The purpose of an experiment is to investigate the relationship between variables
Explanatory Variable
Variable that explains or influences changes in another variable
Represents the cause
Also called:
Independent variable
Input variable
Predictor variable
Response Variable
Variable that is affected by changes in the explanatory variable
Represents the effect
Also called:
Dependent variable
Outcome variable
Output variable
NOTE: An explanatory variable is defined as the independent variable in an experiment. The value or component of the independent variable applied in an experiment is called the treatment
2.3 Stem and Leaf Plot
A Statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population
A good choice when the data sets are small
Each data value will be separated into a “stem” and a “leaf” using its digits
The “leaf” consists of a final significant digit. The stem contains the remaining digits in front of the “leaf”
Stem-and-leaf plot: a graph, especially good for small sets of data, that separates data points into a leaf consisting of the last significant digit and a stem that consists of any numbers to the left of that digit and can be arranged in ascending or descending order
A Stem-and-leaf plot is also commonly referred to as a Stemplot
Create and interpret Dot Plots
Dot plots are graphs used to display the distribution of values in a data set.
How to create Dot plots
Find the minimum and maximum of the data set
Create a horizontal line labeled with the values between the minimum and maximum
Draw a dot above the appropriate value for each number in the data set. Stack the dots vertically as needed
Characteristics of Dot Plots:
Uniform - Dots all across
Unimodal - only mode (most # of dots)
Multimodal - more than one mode
Bimodal - two modes
Symmetric - it looks the same on left as it does on right
Skewed left - dots go up towards left
Skewed right - dots are up from right
Bar Graphs
Bar graph: a graph to summarize and organize categorical data consisting of rectangular bars that are separated from each other and the length (or height) of the bar for each category is proportional to the number or percent of individuals in each category
A Bar graph may also be referred to as a Bar chartLine graphs are most appropriate for showing how a quantity changes over time. This is known as time series data. Using a line graph in other situations can be misleading.
Line graph: a bar graph with the tops of the bars represented by points joined by lines (with rest of the bar not shown) and are only appropriate for ordered (rather than qualitative) variables that show how a quantity changes over time
A Line graph may also be referred to as a Line chart or as a Time series graph
2.6.1 Using measures of Central Tendency Videos
Determine the mean of the test cores
The mean is the average. To determine the mean, add the numbers and then divide by the number of data items
The Downside of a mean is sensitive to the outlier, which is the data that’s oustude the sort of primary grouping of the data.
Finding the Mean from Frequency tables
Given the frequency table below, which equation shows the mean of the set of data?
Data | Frequency |
1 | 15 |
3 | 5 |
7 | 10 |
10 | 2 |
To find the mean from a frequency table, multiply each data value by its frequency. Then add the individual products. 1(15)+3(5)+7(10)+10(2) = 120
Take this sum and divide it by the number of data values, which can be found by adding the numbers in the frequency column. 15+5+10+2 = 32
120 divided by 32 is 3.75. This is the mean of the data from the frequency table.
Estimating the mean from a Grouped frequency table
Grouped Frequency table
Supplies data values in intervals (or groups)
Regular frequency table gives single data values and gives us more information
Having group data intervals means we are able to estimate the mean, but probably not find an exact value
To find the mean, we have to use Midpoints
Mean = Sum of the midpoints divided by the number of data values
Find the Midpoints of all intervals
Multiply the Midpoints by their frequencies
Divide by the number of data values
Question
Given the frequency table below, what is the estimated mean?
Data Intervals | Frequency |
1-4 | 3 |
5-8 | 5 |
9-12 | 2 |
13-17 | 1 |
To find the sum, you can multiply each data value by it's corresponding frequency and then add those products together. The sum of the midpoints multiplied by their frequencies is:
2.5(3) + 6.5(5) + 10.5(2) + 15(1) = 76
To find the number of data values, add the frequencies of the data values: 3+5+2+1=11.
In order to find the mean, divide 76 by 11 to get 6.91, which is the estimated mean of the data.
Find the Median of a data set
The median is the number in the middle when the data is ordered from least to greatest. If there are two middle values, find the mean of the two numbers
Order the values from least to greatest
If there are two middle values, find the mean of both of them
Find the Mode of a Data set
The mode is the number or the numbers that occur the most
It is possible to have more than one mode if they occur the same amount of times
It makes it easier to sort values from least to greatest
What to Report When There is an Outlier
its best to report the median
Mean: Uses all data, but sensitive to outliers
Mode: Easily affected by small changes in frequency
Median: Does not use all data, but is robust
2.6.2 Quartiles and Box Plots
Summary - beginning, middle and end of a set of data
Five number summary components
Sample minimum
First Quadrant (Q1)
Second Quad. (median)
Third Quad (Q3)
Sample Maximum (largest value)
Finding the Five Number summary
1. Sort the # from least to greatest
2. Identify the minimum and the maximum
3. Find the median
4. Find the median of the lower half of the data (Q1)
5. Find the median of the upper half of the data (Q3)
Ex. Given the following list of test sores, find the five number summary:
96, 92, 85, 82, 83, 81, 80, 89, 77, 81, 82, 86, 78, 75, 93
(1) Sort them
75, 77, 78, 80, 81, 81, 82, 82, 83, 85, 86, 86, 89, 92, 93
(2) Find minimum and Maximum
75 and 93
(3) Find the median
82 because its the middle number
(4) Find Q1 (the lower half of the data in between the minimum and median.. So basically its median)
80
(5) Find Q3 (the upper half of the data in between the minimum and median.. So basically its median)
86
How to find the kth Percentile
Percentiles divide ordered data into hundredths
Common measure of location of data values within a data set
Mostly used with very large populations
K represents any number for the percentile
ex. What data value is the 15th percentile - the K value would be 15
How to find the kth percentile
1. Order the data from least to greatest
2. Assign values to the following variables: k = the percentiles n = the total number of data values in data set
3. calculate i, the index (or the position) of a data value
4. Use i to determine the data value at that position
ex. Given this data set, find the 68th percentile:
12, 15, 2, 35, 34, 39, 40, 22, 25
Finding the 68% means that 68% of the other data values are the same or less than this value
(1) Order the data from Least to greatest
2, 12, 15, 22, 25, 34, 35, 39, 40
(2) Assign values to the following variables: k = the percentiles n = the total number of data values in data set
K = 68 n = 9 (there is a total of 9 data values)
(3) calculate i, the index (or the position) of a data value (substitute k and n)
= 6.8 ← the index/position of the data value
(4) Use i to determine the data value at that position (find the 6.8 position in the data set, i will either be a whole number or decimal)
Now since in this example, i is not an integer, then found i up and down to the nearest integers. Find the data values at these positions and average them
6.8 gets rounded up and also down so we end up with 6 and 7. Now we find the data values at the 6th and 7th position.
34 and 35 are the 6th and 7th data values. so you would average them.
the answer is 34.5 ← the 68th percentile of this data
otherwise if i was an integer (whole number) you would just count that many to find it. ex. if it were 4, you would just count 4 data places
What is a Quartile
Quartiles are percentiles
percentiles divide ordered data into hundredths
Quartiles are the 25th percentile (Q2)
50th percentile (Q2) or median
75th percentile (Q3)
How to find Quartiles
a. Can use the same calculations to find percentiles or find the median. Use it to find the first and third quartiles
b. for using the same calculations to find percentiles, you would just replace k with 25, 50, or 75
How to find Quartiles (with the median method)
Order the data from least to greatest
Find the median
use the lower half of the data to find the Q1 (average the two numbers if there is no middle number)
Use the upper half of the data to find the Q3 (same thing)
Interquartile Range (Measuring Spread of Data)
The interquartile range (IQR) is Q3 - Q1
How to find the IQR
Find the Quartiles
Q3 value number - Q1 value number
Identify outliers in a set of data
An outlier is an extremely high or extremely low value in our data. We can identify an outlier if it is greater than Q3 +1.5 (IQR) or lower than Q1 -1.5(IQR)
How to find outliers
Order the data values from least to greatest
Find the Quartiles
Find the interquartile range
Q3 +1.5(IQR) and Q1 +1.5(IQR)
larger than than Q3’s answer? Outliers
Smaller than Q1’s answer? outliers
Box-And-Whisker Plot
Summarizes a set of numerical data based on five key values (the five number summary
Minimum: the data point with the least value
Q1: the least value greter than 25% of the data points
Median: the middle data value
Q3: the least value greater than 75% of the data points
maximum: the data point with the greatest value
Use this five # summary to construct a box-and-whisker plot
5, 9, 17, 22, 34
In a five # summary, numbers are arranged in the order of minimum, Q1, median, Q3, and maximum
DESMOS KEY
Number of data values: L [ ]
Sum of the data values: Total []
Mean of the data
Median of the data
minimum of the data
maximum of the data
Q1 of the data: quartile (L, 1)
Q3 of the data: quartile (L,3)
Sample standard deviation of the data
population standard deviation of the data
2.6.3 - skewness and standard deviation
skew data
data which are mostly clumped in one area but have a few values which are much larger or much smaller
Skew to the right: data has a long tail to the right
Skew to the left: data has a long tail to the left
Symmetrical data
Standard deviation
Standard deviation is a measure of variation based on measuring how car each data value deviates, or is different from the mean. a few important characteristics:
Standard deviation is always posititve. Standard deviation will be zero if all data values are equal, and will get lare as the data spreads out.
Standard deviation has the same units as the original data
Standard deviation, like the mean, can be highly influenced by outliers
How to find standard deviation:
Find the mean (average) of the sample data.
Subtract the mean from each data value to find how far each value is from the average.
Square each of those differences so they are all positive.
Add up all the squared differences.
Divide that total by one less than the number of data values (that means divide by n − 1).
Take the square root of the result
Round to nearest two decimals
Standard Deviation (Sample)
| Standard Deviation (Population) |
s2=∑(x−¯¯¯x)2n−1 s=√variance ...where
|
σ2=∑(x−μ)2N σ=√variance ...where
|
f the datavalues represent data collected from a subset of the population, then the sample standard deviation should be used.
If the datavalues represent data collected from the entire population of interest, then the population standard deviation should be used.
Z-scores
Used to compare scores from different distributions
Values are interpreted in terms of the number of standard deviations above or below the mean
Positive z-score: The value is above the mean
Negative z-score: Value is below the mean
The formula for calculating z-scores is identical when working with a sample or the population
Z
sample variance = var in demos
sample standard deviation = stdev
simple standard population deviation = stevp
Population variance = varp
Entire group = Population
Part of the group = Sample
The z-score closer to zero is “higher” (less below the mean)
The z-score farther from zero is “lower” (more below the mean)
FOR NOTECARD
Need all the symbols for population and sample
How to find range, IQR, Outliers, and Z scores, Kth percentile
How to find Lower bound/fence and upper bound/fence)
“What is the frequency of” - count how many times it occurs
Relative frequency = frequency ÷ total number of observations (gives a proportion or percentage).
Grouped frequency and estimating group frequency