Statistic Test One Notes

Basic Statistical Terms

Population
- Entire group of people or things being studied
Sample
- A subset (smaller portion) of the population
Variable
- A characteristic or factor that can change or vary
Data
- The values of variables
- Can come from observations, counts, measurements, or responses
Census
- Data collected from the entire population

Parameter vs. Statistic

Parameter
- A numerical value that describes a population
- Example: Average age of all people in the U.S.
Statistic
- A numerical value that describes a sample
- Example: Average age of people from a sample of three states
Key Idea
- Population → parameter
- Sample → statistic

Types of Studies

Observational Study

No attempt is made to control individuals or variables
Researchers only observe and record
Used when control is difficult, unethical, or impossible
Examples:
- Smoking and lung cancer
- Premature birth and reading skills
- Public opinion surveys

Experimental Study

Researchers control one or more variables
Subjects are randomly assigned to groups
A treatment is applied
Examples:
- Music type and cognitive test performance
- Pesticide use and crop yields

Descriptive vs. Inferential Statistics

Descriptive Statistics
- Organize, summarize, and display data
- Describe what the data shows
Inferential Statistics
- Use sample data to make conclusions about a population
- Example conclusion: Married men tend to live longer than unmarried men

Qualitative vs. Quantitative Data

Qualitative (Categorical) Data
- Descriptive, non-numerical
- Examples:
  - Color of a car
  - Type of music
Quantitative Data
- Numerical measurements or counts
- Can be:
  - Discrete (countable): number of siblings
  - Continuous (measured): GPA, gallons of water

Methods of Data Collection

Survey
- Collects data by interview, phone, mail, or internet
- Example: Approval rating of the U.S. president
Observational Study
- Observe and measure characteristics
- Example: Children’s behavior study
Experiment
- Apply a treatment and observe responses
- Example: Cinnamon extract and heart disease risk
Simulation
- Uses models (often computers)
- Example: Crash tests using dummies

Experimental Design & Control

Control
- Reduce effects of variables not being studied
Confounding Variables
- When effects of different factors cannot be separated
Placebo Effect
- Subject responds even though no real treatment was given
Blinding
- Subject does not know if they received treatment or placebo
Double-Blind Experiment
- Neither subject nor researcher knows who received treatment

Sampling Techniques

Random Sampling

Every population member has an equal chance of selection
Simple Random Sample
- Every possible sample of the same size has an equal chance

Other Sampling Methods

Stratified Sample
- Divide population into groups (strata) and sample each
Cluster Sample
- Divide into clusters and sample entire clusters
Systematic Sample
- Select every kth member after a random start
Convenience Sample
- Easy to collect but often biased (not recommended)

Experimental Designs

Completely Randomized Design
- Subjects randomly assigned to treatment groups
Randomized Block Design
- Subjects grouped by similar traits, then randomized
Matched-Pairs Design
- Similar subjects paired; each receives different treatment

Sample Size & Replication

Sample Size
- Number of subjects in a study
- Larger samples → more reliable results
Replication
- Repeating an experiment with many subjects

Misleading Data
- Data can be misleading when the sample is biased
- Bias reduces the accuracy of results
Bias in Studies
- Biases disrupt the validity of conclusions
- Results from biased samples are not accurate or reliable
Evaluating a Study
- Ask: Is the study impartial?
- Determine whether the sample fairly represents the population
Example of Bias
- Surveying only college students to estimate opinions of all adults
- This creates sampling bias because not all age groups are represented
Key Idea
- Biased samples → inaccurate and invalid results

Outlier
- A value that is much higher or much lower than the rest of the data
- Does not follow the overall pattern of the data set
Why Outliers Matter
- Can skew results, especially the mean (average)
- May make data misleading
- Can affect conclusions
Causes of Outliers
- Measurement or recording errors
- Unusual but valid values
- Data entry mistakes
Key Idea
- Outliers should be investigated, not automatically removed

Qualitative Data
- Data that describes or categorizes attributes of a population
- Usually expressed using words or letters
- Also called Categorical data
- Phrases like which type or what kind indicate qualitative data.
Quantitative Data
- Data that results from counting or measuring
- Always expressed using numbers
- Represents numerical values of attributes
- Phrases like how many or the number of indicate that data is quantitative.
Discrete Data (type of quantitative data)
- Countable numbers
- No fractions or decimals
Continuous Data (type of quantitative data)
- Data that can take any value in a range
- can include fractions and decimals
- Continuous data is defined as the type of quantitative data that is the result of measuring

The purpose of an experiment is to investigate the relationship between variables

Explanatory Variable
- Variable that explains or influences changes in another variable
- Represents the cause
- Also called:
  - Independent variable
  - Input variable
    Predictor variable

Response Variable
- Variable that is affected by changes in the explanatory variable
- Represents the effect
- Also called:
  - Dependent variable
  - Outcome variable
  - Output variable

NOTE: An explanatory variable is defined as the independent variable in an experiment. The value or component of the independent variable applied in an experiment is called the treatment

2.3 Stem and Leaf Plot

A Statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population
A good choice when the data sets are small
Each data value will be separated into a “stem” and a “leaf” using its digits
The “leaf” consists of a final significant digit. The stem contains the remaining digits in front of the “leaf”
Stem-and-leaf plot: a graph, especially good for small sets of data, that separates data points into a leaf consisting of the last significant digit and a stem that consists of any numbers to the left of that digit and can be arranged in ascending or descending order
A Stem-and-leaf plot is also commonly referred to as a Stemplot

Create and interpret Dot Plots

Dot plots are graphs used to display the distribution of values in a data set.

How to create Dot plots

Find the minimum and maximum of the data set
Create a horizontal line labeled with the values between the minimum and maximum
Draw a dot above the appropriate value for each number in the data set. Stack the dots vertically as needed

Characteristics of Dot Plots:

Uniform - Dots all across
Unimodal - only mode (most # of dots)
Multimodal - more than one mode
Bimodal - two modes
Symmetric - it looks the same on left as it does on right
Skewed left - dots go up towards left
Skewed right - dots are up from right

Bar Graphs

Bar graph: a graph to summarize and organize categorical data consisting of rectangular bars that are separated from each other and the length (or height) of the bar for each category is proportional to the number or percent of individuals in each category
A Bar graph may also be referred to as a Bar chart
Line graphs are most appropriate for showing how a quantity changes over time. This is known as time series data. Using a line graph in other situations can be misleading.
Line graph: a bar graph with the tops of the bars represented by points joined by lines (with rest of the bar not shown) and are only appropriate for ordered (rather than qualitative) variables that show how a quantity changes over time
A Line graph may also be referred to as a Line chart or as a Time series graph

2.6.1 Using measures of Central Tendency Videos

Determine the mean of the test cores

The mean is the average. To determine the mean, add the numbers and then divide by the number of data items
The Downside of a mean is sensitive to the outlier, which is the data that’s oustude the sort of primary grouping of the data.

Finding the Mean from Frequency tables

Given the frequency table below, which equation shows the mean of the set of data?

Data	Frequency
1	15
3	5
7	10
10	2

To find the mean from a frequency table, multiply each data value by its frequency. Then add the individual products. 1(15)+3(5)+7(10)+10(2) = 120

Take this sum and divide it by the number of data values, which can be found by adding the numbers in the frequency column. 15+5+10+2 = 32

120 divided by 32 is 3.75. This is the mean of the data from the frequency table.

Estimating the mean from a Grouped frequency table

Grouped Frequency table

Supplies data values in intervals (or groups)
Regular frequency table gives single data values and gives us more information
Having group data intervals means we are able to estimate the mean, but probably not find an exact value

To find the mean, we have to use Midpoints

Mean = Sum of the midpoints divided by the number of data values

Find the Midpoints of all intervals
Multiply the Midpoints by their frequencies
Divide by the number of data values

Question

Given the frequency table below, what is the estimated mean?

Data Intervals	Frequency
1-4	3
5-8	5
9-12	2
13-17	1

To find the sum, you can multiply each data value by it's corresponding frequency and then add those products together. The sum of the midpoints multiplied by their frequencies is:

2.5(3) + 6.5(5) + 10.5(2) + 15(1) = 76

To find the number of data values, add the frequencies of the data values: 3+5+2+1=11.

In order to find the mean, divide 76 by 11 to get 6.91, which is the estimated mean of the data.

Find the Median of a data set

The median is the number in the middle when the data is ordered from least to greatest. If there are two middle values, find the mean of the two numbers

Order the values from least to greatest
If there are two middle values, find the mean of both of them

Find the Mode of a Data set

The mode is the number or the numbers that occur the most

It is possible to have more than one mode if they occur the same amount of times
It makes it easier to sort values from least to greatest

What to Report When There is an Outlier

its best to report the median

Mean: Uses all data, but sensitive to outliers
Mode: Easily affected by small changes in frequency
Median: Does not use all data, but is robust

2.6.2 Quartiles and Box Plots

Summary - beginning, middle and end of a set of data
Five number summary components
- Sample minimum
- First Quadrant (Q1)
- Second Quad. (median)
- Third Quad (Q3)
- Sample Maximum (largest value)
Finding the Five Number summary
- 1. Sort the # from least to greatest
- 2. Identify the minimum and the maximum
- 3. Find the median
- 4. Find the median of the lower half of the data (Q1)
- 5. Find the median of the upper half of the data (Q3)

Ex. Given the following list of test sores, find the five number summary:

96, 92, 85, 82, 83, 81, 80, 89, 77, 81, 82, 86, 78, 75, 93

(1) Sort them

75, 77, 78, 80, 81, 81, 82, 82, 83, 85, 86, 86, 89, 92, 93

(2) Find minimum and Maximum

75 and 93

(3) Find the median

82 because its the middle number

(4) Find Q1 (the lower half of the data in between the minimum and median.. So basically its median)

(5) Find Q3 (the upper half of the data in between the minimum and median.. So basically its median)

How to find the kth Percentile

Percentiles divide ordered data into hundredths
Common measure of location of data values within a data set
Mostly used with very large populations
K represents any number for the percentile
- ex. What data value is the 15th percentile - the K value would be 15

How to find the kth percentile
- 1. Order the data from least to greatest
- 2. Assign values to the following variables: k = the percentiles n = the total number of data values in data set
- 3. calculate i, the index (or the position) of a data value
- 4. Use i to determine the data value at that position

ex. Given this data set, find the 68th percentile:

12, 15, 2, 35, 34, 39, 40, 22, 25

Finding the 68% means that 68% of the other data values are the same or less than this value

(1) Order the data from Least to greatest

2, 12, 15, 22, 25, 34, 35, 39, 40

(2) Assign values to the following variables: k = the percentiles n = the total number of data values in data set

K = 68 n = 9 (there is a total of 9 data values)

(3) calculate i, the index (or the position) of a data value (substitute k and n)

$i=\frac{k}{100}\left(n+1\right)$

$i=\frac{68}{100}\left(9+1\right)$

= 6.8 ← the index/position of the data value

(4) Use i to determine the data value at that position (find the 6.8 position in the data set, i will either be a whole number or decimal)

Now since in this example, i is not an integer, then found i up and down to the nearest integers. Find the data values at these positions and average them

6.8 gets rounded up and also down so we end up with 6 and 7. Now we find the data values at the 6th and 7th position.

34 and 35 are the 6th and 7th data values. so you would average them.

the answer is 34.5 ← the 68th percentile of this data

otherwise if i was an integer (whole number) you would just count that many to find it. ex. if it were 4, you would just count 4 data places

What is a Quartile

Quartiles are percentiles
percentiles divide ordered data into hundredths
Quartiles are the 25th percentile (Q2)
50th percentile (Q2) or median
75th percentile (Q3)

How to find Quartiles

a. Can use the same calculations to find percentiles or find the median. Use it to find the first and third quartiles

b. for using the same calculations to find percentiles, you would just replace k with 25, 50, or 75

How to find Quartiles (with the median method)

Order the data from least to greatest
Find the median
use the lower half of the data to find the Q1 (average the two numbers if there is no middle number)
Use the upper half of the data to find the Q3 (same thing)

Interquartile Range (Measuring Spread of Data)

The interquartile range (IQR) is Q3 - Q1

How to find the IQR

Find the Quartiles
Q3 value number - Q1 value number

Identify outliers in a set of data

An outlier is an extremely high or extremely low value in our data. We can identify an outlier if it is greater than Q3 +1.5 (IQR) or lower than Q1 -1.5(IQR)

How to find outliers

Order the data values from least to greatest
Find the Quartiles
Find the interquartile range
Q3 +1.5(IQR) and Q1 +1.5(IQR)
larger than than Q3’s answer? Outliers
Smaller than Q1’s answer? outliers

Box-And-Whisker Plot

Summarizes a set of numerical data based on five key values (the five number summary
- Minimum: the data point with the least value
- Q1: the least value greter than 25% of the data points
- Median: the middle data value
- Q3: the least value greater than 75% of the data points
- maximum: the data point with the greatest value

Use this five # summary to construct a box-and-whisker plot

5, 9, 17, 22, 34

In a five # summary, numbers are arranged in the order of minimum, Q1, median, Q3, and maximum

DESMOS KEY

Number of data values: L [ ]

Sum of the data values: Total []

Mean of the data

Median of the data

minimum of the data

maximum of the data

Q1 of the data: quartile (L, 1)

Q3 of the data: quartile (L,3)

Sample standard deviation of the data

population standard deviation of the data

2.6.3 - skewness and standard deviation

skew data

data which are mostly clumped in one area but have a few values which are much larger or much smaller
Skew to the right: data has a long tail to the right
Skew to the left: data has a long tail to the left
Symmetrical data

Standard deviation

Standard deviation is a measure of variation based on measuring how car each data value deviates, or is different from the mean. a few important characteristics:
- Standard deviation is always posititve. Standard deviation will be zero if all data values are equal, and will get lare as the data spreads out.
- Standard deviation has the same units as the original data
- Standard deviation, like the mean, can be highly influenced by outliers

How to find standard deviation:

Find the mean (average) of the sample data.
Subtract the mean from each data value to find how far each value is from the average.
Square each of those differences so they are all positive.
Add up all the squared differences.
Divide that total by one less than the number of data values (that means divide by n − 1).
Take the square root of the result

Round to nearest two decimals

Standard Deviation (Sample)

Standard Deviation (Population)

s2=∑(x−¯¯¯x)2n−1

s=√variance

...where

s2 = variance
s = standard deviation
x = specific data value
¯¯¯x = sample mean
n = sample size

σ2=∑(x−μ)2N

σ=√variance

...where

σ2 = variance
σ = standard deviation
x = specific data value
μ = population mean
N = size of the population

f the datavalues represent data collected from a subset of the population, then the sample standard deviation should be used.
If the datavalues represent data collected from the entire population of interest, then the population standard deviation should be used.

Z-scores

Used to compare scores from different distributions
Values are interpreted in terms of the number of standard deviations above or below the mean
- Positive z-score: The value is above the mean
- Negative z-score: Value is below the mean
The formula for calculating z-scores is identical when working with a sample or the population

Z $Z=\frac{DataValue-Mean}{StandardDeviation}$

sample variance = var in demos

sample standard deviation = stdev

simple standard population deviation = stevp

Population variance = varp

Entire group = Population
Part of the group = Sample

The z-score closer to zero is “higher” (less below the mean)
The z-score farther from zero is “lower” (more below the mean)

FOR NOTECARD

Need all the symbols for population and sample
How to find range, IQR, Outliers, and Z scores, Kth percentile
How to find Lower bound/fence and upper bound/fence)
“What is the frequency of” - count how many times it occurs
Relative frequency = frequency ÷ total number of observations (gives a proportion or percentage).
Grouped frequency and estimating group frequency