Statistic Test One Notes

Basic Statistical Terms

  • Population

    • Entire group of people or things being studied

  • Sample

    • A subset (smaller portion) of the population

  • Variable

    • A characteristic or factor that can change or vary

  • Data

    • The values of variables

    • Can come from observations, counts, measurements, or responses

  • Census

    • Data collected from the entire population

Parameter vs. Statistic

  • Parameter

    • A numerical value that describes a population

    • Example: Average age of all people in the U.S.

  • Statistic

    • A numerical value that describes a sample

    • Example: Average age of people from a sample of three states

  • Key Idea

    • Population → parameter

    • Sample → statistic

Types of Studies

Observational Study

  • No attempt is made to control individuals or variables

  • Researchers only observe and record

  • Used when control is difficult, unethical, or impossible

  • Examples:

    • Smoking and lung cancer

    • Premature birth and reading skills

    • Public opinion surveys

Experimental Study

  • Researchers control one or more variables

  • Subjects are randomly assigned to groups

  • A treatment is applied

  • Examples:

    • Music type and cognitive test performance

    • Pesticide use and crop yields

Descriptive vs. Inferential Statistics

  • Descriptive Statistics

    • Organize, summarize, and display data

    • Describe what the data shows

  • Inferential Statistics

    • Use sample data to make conclusions about a population

    • Example conclusion: Married men tend to live longer than unmarried men

Qualitative vs. Quantitative Data

  • Qualitative (Categorical) Data

    • Descriptive, non-numerical

    • Examples:

      • Color of a car

      • Type of music

  • Quantitative Data

    • Numerical measurements or counts

    • Can be:

      • Discrete (countable): number of siblings

      • Continuous (measured): GPA, gallons of water

Methods of Data Collection

  • Survey

    • Collects data by interview, phone, mail, or internet

    • Example: Approval rating of the U.S. president

  • Observational Study

    • Observe and measure characteristics

    • Example: Children’s behavior study

  • Experiment

    • Apply a treatment and observe responses

    • Example: Cinnamon extract and heart disease risk

  • Simulation

    • Uses models (often computers)

    • Example: Crash tests using dummies

Experimental Design & Control

  • Control

    • Reduce effects of variables not being studied

  • Confounding Variables

    • When effects of different factors cannot be separated

  • Placebo Effect

    • Subject responds even though no real treatment was given

  • Blinding

    • Subject does not know if they received treatment or placebo

  • Double-Blind Experiment

    • Neither subject nor researcher knows who received treatment

Sampling Techniques

Random Sampling

  • Every population member has an equal chance of selection

  • Simple Random Sample

    • Every possible sample of the same size has an equal chance

Other Sampling Methods

  • Stratified Sample

    • Divide population into groups (strata) and sample each

  • Cluster Sample

    • Divide into clusters and sample entire clusters

  • Systematic Sample

    • Select every kth member after a random start

  • Convenience Sample

    • Easy to collect but often biased (not recommended)

Experimental Designs

  • Completely Randomized Design

    • Subjects randomly assigned to treatment groups

  • Randomized Block Design

    • Subjects grouped by similar traits, then randomized

  • Matched-Pairs Design

    • Similar subjects paired; each receives different treatment

Sample Size & Replication

  • Sample Size

    • Number of subjects in a study

    • Larger samples → more reliable results

  • Replication

    • Repeating an experiment with many subjects

  • Misleading Data

    • Data can be misleading when the sample is biased

    • Bias reduces the accuracy of results

  • Bias in Studies

    • Biases disrupt the validity of conclusions

    • Results from biased samples are not accurate or reliable

  • Evaluating a Study

    • Ask: Is the study impartial?

    • Determine whether the sample fairly represents the population

  • Example of Bias

    • Surveying only college students to estimate opinions of all adults

    • This creates sampling bias because not all age groups are represented

  • Key Idea

    • Biased samples → inaccurate and invalid results

  • Outlier

    • A value that is much higher or much lower than the rest of the data

    • Does not follow the overall pattern of the data set

  • Why Outliers Matter

    • Can skew results, especially the mean (average)

    • May make data misleading

    • Can affect conclusions

  • Causes of Outliers

    • Measurement or recording errors

    • Unusual but valid values

    • Data entry mistakes

  • Key Idea

    • Outliers should be investigated, not automatically removed

  • Qualitative Data

    • Data that describes or categorizes attributes of a population

    • Usually expressed using words or letters

    • Also called Categorical data

    • Phrases like which type or what kind indicate qualitative data.

  • Quantitative Data

    • Data that results from counting or measuring

    • Always expressed using numbers

    • Represents numerical values of attributes

    • Phrases like how many or the number of indicate that data is quantitative.

  • Discrete Data (type of quantitative data)

    • Countable numbers

    • No fractions or decimals

  • Continuous Data (type of quantitative data)

    • Data that can take any value in a range

    • can include fractions and decimals

    • Continuous data is defined as the type of quantitative data that is the result of measuring

The purpose of an experiment is to investigate the relationship between variables

  • Explanatory Variable

    • Variable that explains or influences changes in another variable

    • Represents the cause

    • Also called:

      • Independent variable

      • Input variable

        Predictor variable

  • Response Variable

    • Variable that is affected by changes in the explanatory variable

    • Represents the effect

    • Also called:

      • Dependent variable

      • Outcome variable

      • Output variable

NOTE: An explanatory variable is defined as the independent variable in an experiment. The value or component of the independent variable applied in an experiment is called the treatment


2.3 Stem and Leaf Plot

  • A Statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population

  • A good choice when the data sets are small

  • Each data value will be separated into a “stem” and a “leaf” using its digits

  • The “leaf” consists of a final significant digit. The stem contains the remaining digits in front of the “leaf”

  • Stem-and-leaf plot: a graph, especially good for small sets of data, that separates data points into a leaf consisting of the last significant digit and a stem that consists of any numbers to the left of that digit and can be arranged in ascending or descending order
    Stem-and-leaf plot is also commonly referred to as a Stemplot

Create and interpret Dot Plots

  • Dot plots are graphs used to display the distribution of values in a data set.

How to create Dot plots

  1. Find the minimum and maximum of the data set

  2. Create a horizontal line labeled with the values between the minimum and maximum

  3. Draw a dot above the appropriate value for each number in the data set. Stack the dots vertically as needed

Characteristics of Dot Plots:

  1. Uniform - Dots all across

  2. Unimodal - only mode (most # of dots)

  3. Multimodal - more than one mode

  4. Bimodal - two modes

  5. Symmetric - it looks the same on left as it does on right

  6. Skewed left - dots go up towards left

  7. Skewed right - dots are up from right


Bar Graphs

  • Bar graph: a graph to summarize and organize categorical data consisting of rectangular bars that are separated from each other and the length (or height) of the bar for each category is proportional to the number or percent of individuals in each category
    Bar graph may also be referred to as a Bar chart

  • Line graphs are most appropriate for showing how a quantity changes over time. This is known as time series data. Using a line graph in other situations can be misleading.

  • Line graph: a bar graph with the tops of the bars represented by points joined by lines (with rest of the bar not shown) and are only appropriate for ordered (rather than qualitative) variables that show how a quantity changes over time
    Line graph may also be referred to as a Line chart or as a Time series graph



2.6.1 Using measures of Central Tendency Videos

Determine the mean of the test cores

  • The mean is the average. To determine the mean, add the numbers and then divide by the number of data items

  • The Downside of a mean is sensitive to the outlier, which is the data that’s oustude the sort of primary grouping of the data.

Finding the Mean from Frequency tables

Given the frequency table below, which equation shows the mean of the set of data? 

Data

Frequency

1

15

3

5

7

10

10

2

To find the mean from a frequency table, multiply each data value by its frequency. Then add the individual products. 1(15)+3(5)+7(10)+10(2) = 120

Take this sum and divide it by the number of data values, which can be found by adding the numbers in the frequency column. 15+5+10+2 = 32

120 divided by 32 is 3.75. This is the mean of the data from the frequency table. 


Estimating the mean from a Grouped frequency table

Grouped Frequency table

  • Supplies data values in intervals (or groups)

  • Regular frequency table gives single data values and gives us more information

  • Having group data intervals means we are able to estimate the mean, but probably not find an exact value

To find the mean, we have to use Midpoints

  • Mean = Sum of the midpoints divided by the number of data values

  1. Find the Midpoints of all intervals

  2. Multiply the Midpoints by their frequencies

  3. Divide by the number of data values

Question

Given the frequency table below, what is the estimated mean? 

Data Intervals

Frequency

1-4

3

5-8

5

9-12

2

13-17

1

To find the sum, you can multiply each data value by it's corresponding frequency and then add those products together. The sum of the midpoints multiplied by their frequencies is: 

  2.5(3)6.5(5) + 10.5(2) + 15(1) = 76

To find the number of data values, add the frequencies of the data values: 3+5+2+1=11

In order to find the mean, divide 76 by 11 to get 6.91, which is the estimated mean of the data. 


Find the Median of a data set

The median is the number in the middle when the data is ordered from least to greatest. If there are two middle values, find the mean of the two numbers

  1. Order the values from least to greatest

  2. If there are two middle values, find the mean of both of them

Find the Mode of a Data set

The mode is the number or the numbers that occur the most

  • It is possible to have more than one mode if they occur the same amount of times

  • It makes it easier to sort values from least to greatest

What to Report When There is an Outlier

its best to report the median

  • Mean: Uses all data, but sensitive to outliers

  • Mode: Easily affected by small changes in frequency

  • Median: Does not use all data, but is robust

2.6.2 Quartiles and Box Plots

  • Summary - beginning, middle and end of a set of data

  • Five number summary components

    • Sample minimum

    • First Quadrant (Q1)

    • Second Quad. (median)

    • Third Quad (Q3)

    • Sample Maximum (largest value)

  • Finding the Five Number summary

    • 1. Sort the # from least to greatest

    • 2. Identify the minimum and the maximum

    • 3. Find the median

    • 4. Find the median of the lower half of the data (Q1)

    • 5. Find the median of the upper half of the data (Q3)

Ex. Given the following list of test sores, find the five number summary:

96, 92, 85, 82, 83, 81, 80, 89, 77, 81, 82, 86, 78, 75, 93

(1) Sort them

75, 77, 78, 80, 81, 81, 82, 82, 83, 85, 86, 86, 89, 92, 93

(2) Find minimum and Maximum

75 and 93

(3) Find the median

82 because its the middle number

(4) Find Q1 (the lower half of the data in between the minimum and median.. So basically its median)

80

(5) Find Q3 (the upper half of the data in between the minimum and median.. So basically its median)

86


How to find the kth Percentile

  • Percentiles divide ordered data into hundredths

  • Common measure of location of data values within a data set

  • Mostly used with very large populations

  • K represents any number for the percentile

    • ex. What data value is the 15th percentile - the K value would be 15

  • How to find the kth percentile

    • 1. Order the data from least to greatest

    • 2. Assign values to the following variables: k = the percentiles n = the total number of data values in data set

    • 3. calculate i, the index (or the position) of a data value

    • 4. Use i to determine the data value at that position

ex. Given this data set, find the 68th percentile:

12, 15, 2, 35, 34, 39, 40, 22, 25

Finding the 68% means that 68% of the other data values are the same or less than this value

(1) Order the data from Least to greatest

2, 12, 15, 22, 25, 34, 35, 39, 40

(2) Assign values to the following variables: k = the percentiles n = the total number of data values in data set

K = 68 n = 9 (there is a total of 9 data values)

(3) calculate i, the index (or the position) of a data value (substitute k and n)

i=k100(n+1)i=\frac{k}{100}\left(n+1\right)

i=68100(9+1)i=\frac{68}{100}\left(9+1\right)

= 6.8 ← the index/position of the data value

(4) Use i to determine the data value at that position (find the 6.8 position in the data set, i will either be a whole number or decimal)

Now since in this example, i is not an integer, then found i up and down to the nearest integers. Find the data values at these positions and average them

6.8 gets rounded up and also down so we end up with 6 and 7. Now we find the data values at the 6th and 7th position.

34 and 35 are the 6th and 7th data values. so you would average them.

the answer is 34.5 ← the 68th percentile of this data

otherwise if i was an integer (whole number) you would just count that many to find it. ex. if it were 4, you would just count 4 data places


What is a Quartile

  • Quartiles are percentiles

  • percentiles divide ordered data into hundredths

  • Quartiles are the 25th percentile (Q2)

  • 50th percentile (Q2) or median

  • 75th percentile (Q3)

How to find Quartiles

a. Can use the same calculations to find percentiles or find the median. Use it to find the first and third quartiles

b. for using the same calculations to find percentiles, you would just replace k with 25, 50, or 75

How to find Quartiles (with the median method)

  1. Order the data from least to greatest

  2. Find the median

  3. use the lower half of the data to find the Q1 (average the two numbers if there is no middle number)

  4. Use the upper half of the data to find the Q3 (same thing)


Interquartile Range (Measuring Spread of Data)

  • The interquartile range (IQR) is Q3 - Q1

How to find the IQR

  1. Find the Quartiles

  2. Q3 value number - Q1 value number


Identify outliers in a set of data

  • An outlier is an extremely high or extremely low value in our data. We can identify an outlier if it is greater than Q3 +1.5 (IQR) or lower than Q1 -1.5(IQR)

How to find outliers

  1. Order the data values from least to greatest

  2. Find the Quartiles

  3. Find the interquartile range

  4. Q3 +1.5(IQR) and Q1 +1.5(IQR)

  5. larger than than Q3’s answer? Outliers

  6. Smaller than Q1’s answer? outliers


Box-And-Whisker Plot

  • Summarizes a set of numerical data based on five key values (the five number summary

    • Minimum: the data point with the least value

    • Q1: the least value greter than 25% of the data points

    • Median: the middle data value

    • Q3: the least value greater than 75% of the data points

    • maximum: the data point with the greatest value

Use this five # summary to construct a box-and-whisker plot

5, 9, 17, 22, 34

In a five # summary, numbers are arranged in the order of minimum, Q1, median, Q3, and maximum


DESMOS KEY

Number of data values: L [ ]

Sum of the data values: Total []

Mean of the data

Median of the data

minimum of the data

maximum of the data

Q1 of the data: quartile (L, 1)

Q3 of the data: quartile (L,3)

Sample standard deviation of the data

population standard deviation of the data


2.6.3 - skewness and standard deviation

skew data

  • data which are mostly clumped in one area but have a few values which are much larger or much smaller

  • Skew to the right: data has a long tail to the right

  • Skew to the left: data has a long tail to the left

  • Symmetrical data


Standard deviation

  • Standard deviation is a measure of variation based on measuring how car each data value deviates, or is different from the mean. a few important characteristics:

    • Standard deviation is always posititve. Standard deviation will be zero if all data values are equal, and will get lare as the data spreads out.

    • Standard deviation has the same units as the original data

    • Standard deviation, like the mean, can be highly influenced by outliers

How to find standard deviation:

  • Find the mean (average) of the sample data.

  • Subtract the mean from each data value to find how far each value is from the average.

  • Square each of those differences so they are all positive.

  • Add up all the squared differences.

  • Divide that total by one less than the number of data values (that means divide by n − 1).

  • Take the square root of the result

Round to nearest two decimals

 

Standard Deviation (Sample​)

 

Standard Deviation (Population)

 

s2=(x−¯¯¯x)2n−1

s=√variance

...where

  • s2variance

  • sstandard deviation

  • x = specific data value

  • ¯¯¯x = sample mean

  • n = sample size

 

σ2=(xμ)2N

σ=√variance

...where

  • σ2variance

  • σstandard deviation

  • x = specific data value

  • μ = population mean

  • N = size of the population

  • f the datavalues represent data collected from a subset of the population, then the sample standard deviation should be used.

  • If the datavalues represent data collected from the entire population of interest, then the population standard deviation should be used.

Z-scores

  • Used to compare scores from different distributions

  • Values are interpreted in terms of the number of standard deviations above or below the mean

    • Positive z-score: The value is above the mean

    • Negative z-score: Value is below the mean

  • The formula for calculating z-scores is identical when working with a sample or the population

ZZ=DataValueMeanStandardDeviationZ=\frac{DataValue-Mean}{StandardDeviation}

sample variance = var in demos

sample standard deviation = stdev

simple standard population deviation = stevp

Population variance = varp

  • Entire group = Population

  • Part of the group = Sample

  • The z-score closer to zero is “higher” (less below the mean)

  • The z-score farther from zero is “lower” (more below the mean)

FOR NOTECARD

  • Need all the symbols for population and sample

  • How to find range, IQR, Outliers, and Z scores, Kth percentile

  • How to find Lower bound/fence and upper bound/fence)

  • “What is the frequency of” - count how many times it occurs

  • Relative frequency = frequency ÷ total number of observations (gives a proportion or percentage).

  • Grouped frequency and estimating group frequency