1/147
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is stats
The science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them
Key Concept
The process involved in conducting a statistical study consists of “prepare, analyze, and conclude.”
Statistical Thinking
Demands so much more than the ability to execute complicated calculations. It involves critical thinking and the ability to make sense of results
Data
Collections of observations, such as measurements, genders, or survey responses
Population
The complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data about which we would like to make inferences
Parameter
Any numerical measurement that describes some characteristic of the population
Sample
It is a subcollection or subset of measurements, objects, or individuals from the population
Statistics
A numerical measurement that describes some characteristic of the sample
Prepare
One common but “generally” bad sampling practice: Voluntary Presponse Sample
Voluntary Response Sample or Self-Selected Sample is one in which the respondents themselves decide whether to be included.
It’s bad because people can decide whether to reply or not
Analyze: Graph and Explore
An analysis should begin with appropriate graphs and explorations of the data.
Analyze: Apply stats methods
A good statistical analysis does not require strong computational skills. A good statistical analysis does require using common sense and paying careful attention to sound statistical methods.
Analyze: Conclude
We need to distinguish between statistical significance and practical significance.
Conclude: Statistical Signifigance
Achieved in a study if the likelihood of an event occurring by chance is 5% or less.
Conclude: Pratical Signfigance
It is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.
Misleading conclusions
When froming a concusion based on a statistical analysis, we sohudl make statments that are clear even to those who have no understadning
Sample Data Reported Instead of Measured
When collecting data from people it is
Loaded Questions
if survey questions are not worded carefully,
the results of a study can be misleading
Order of Questions
Sometimes survey questions are
unintentionally loaded by the order of the
items being considered
Nonresponse
A nonresponse occurs when someone
either refuses to respond or is unavailable.
Percentages
Some studies cite misleading percentages.
Note that 100% of some quantity is all of it,
but if there are references made to
percentages that exceed 100%, such
references are often not justified
Types of Data: Quantitative
(numerical) data consists of numbers that are
measurements or counts.
• Grams of sugar in a cookie
• Number of books a student owns
• Age (in years) when a person first drove a ca
Types of Data: Categorical
(qualitative/attribute) data consists of labels or names
that do not represent counts or measurements.
• Favourite Colour
• Nationality
• Student Number
Types of Quantitative Data: Discrete
data occurs when there is a finite or “countable” number
of values that the data can have (i.e. the number of possible values is
0, 1, 2, 3, . . .).
• Number of books a student owns
• Age (in years) when a person first drove a ca
Types of Quantitative data Continuous
data occurs when there is an infinite number of
values, such as it is not possible to count the number of values.
• Grams of sugar in a cookie.
• Amount of milk that a cow produce
Levels of Measurement
NOIR
Levels of Measurement N
Nominal consists of names, labels, categories only. Has no logical
order, like from low to high.
• Favourite Colour
• Country of birth
Levels of Measurement O
Ordinal data can be placed in a logical order, but differences
between values cannot be obtained.
• E.g. letter grades in a course (A, B, C, D, or F)
• T-shirt sizes
Levels of Measurement I
Interval data can be placed in order and the differences between
any two data values is meaningful. However, there is no natural
zero starting point (where none of the quantity is present).
• Years 1863, 1867, 2001, 1953
• Temperature in Celsius
Levels of Measurement R
Ratio data is interval data with the additional property that there
is also a natural zero starting point (where zero indicates none of
the quantity is present); This means that ratios between values are
meaningful.
• Amount of money in a bank account
• Height
Missing Data
A data value is missing completely at random if the likelihood of its being missing is
independent of its value or any of the other values in the data set. That is, any data
value is just as likely to be missing as any other data value
Big Data
Data sets so large and so complex that their analysis is beyond the
capabilities of traditional software tools (may require software
simultaneously running in parallel
Data Science
involves applications of statistics, computer science, and
software engineering, along with some other relevant fields (such as
sociology or finance
Collecting Sample Data: Observational Study
data is collected without modifying or
interfering with the study subjects
Collecting Sample Data: Experiment
Involves applying apply some treatment and
then observe its effects on the subject
SImple Random Sample
A sample of n subjects is selected in
such a way that every possible sample
of the same size n has the same
chance of being chosen
Systematic Sampling
Begin at some starting
point then select every
k-th object in the
population
Convenience Sampling
Gather data in an easy way
Stratified sampling
Divide
population into at least two
strata (subgroups) such that
objects in each subgroup
have similar characteristics.
Then randomly sample some
objects from within each
strata
Cluster Sampling
Divide
population into sections or
clusters, then randomly select
some clusters and then use every
person/object in those cluster
Voluntary-response sampling
Individuals self-select to be in a study or surgery
Random sampling error
A discrepancy between a sample
result and the true population
result; such an error results from
chance sample fluctuations.
NOn-sampling error
Sample data incorrectly collected, recorded,
or analyzed (such as by selecting a biased
sample, using a defective instrument,
copying the data incorrectly, or applying
statistical methods not appropriate for the
circumstances)
Non-random sampling error
Result of using a sampling method that is
not random, such as using a convenience
sample or a voluntary response sample
Frequency Distribution
shows how a data set is partitioned among several
classes (categories) by listing all categories along with
the number (frequency) of data values in each
Lower class limits
Are the samllest numbers that belong to each class
Upper class limits
Are the largest numbers that belong to each class
Class boundaries
are the numbers used to separate
classes, but without the gaps created
by class limits
Class midpoints
are the values in the middle of the
classes and can be found by adding
the lower class limit to the upper class
limit and dividing the sum by 2
Class width
is the difference between two consecutive lower class limits or two consecutive upper limits
Relative Frequency Distribution
includes the same class limits as a frequency distribution, but the
frequency of a class is replaced with a relative frequency (a proportion or a
percentage) ( relative frequency = class frequency/sum of all frequencies)
Cumulatiev frequency distribution
You add the number before that and then your value together shoudl add to total value by end of chart
Important characteristics of Data
Center, variation, distribution, outliers, time
Centre
A representative value that indicates where the middle of the
data set is located
Variation
measure of the amount of spread in the data
Distribution
he nature or shape of the set of data over a range of
values (such as bell-shaped, uniform, or skewed)
Outliers
Sample values that lie very far from the vast majority of
other sample values
Time
Changing characteristics of the data over time
Histogram
We can use a visual tool called a histogram to determine the shape of a distribution of data.
• A histogram provides a visual display (graph) of a frequency distribution
SHapes of distribution: Normal
Bell-
curve; most
of the data in
the center,
tails on either
side.
Shapes of distribution: Uniform
Different
possible values occur
with approximately
the same frequency
SHapes of distribution Skewed right (positively)
The data is mostly
on the left, with a
longer tail to the
right
Shapes of distribution: SKewed Left (negative)
The data is mostly on
the right, with a
longer tail to the left
Dotplot
Consists of a graph in which each data value is plotted as a point (or dot) along a scale
of values. Dots that are stacked represent multiple observations of the same values.
Stem-and-leaf plot
Used to display quantitative data by separating values into a “stem” and “leaf”.
• Helps in sorting data and provides a simple visualization of the distribution
Pie Chart
A graph depicting qualitative data as slices of a circle, in which
the size of each slice is proportional to frequency count
4 Measures
Mean, Median, Mode, Midrange
Mean
The mean (or arithmetic mean) of a set of data is the measure
of centre found by adding all data values and dividing the total
by the number of data values
n
represents the number of data
values in a sample.
N
represents the number of data
values in a population.
Resistant
A statistic is resistant if the presence of extreme values
(outliers) does not cause it to change very much.
Mean disadvantage
of the mean is that just one extreme value (outlier) can
change the value of the mean substantially. (Using the following
definition, we say that the mean is not resistant.
Median (resistant)
The median of a data set is the middle value when the
original data values are arranged in order of increasing (or
decreasing) magnitude.
Mode (resistant)
The mode of a data set is the value(s) that occur(s) with the greatest
frequency (qualitative data) Bimodal 2, multimodal 3
Midrange
The midrange of a data is the value midway
between the maximum and minimum values in
the original data set. It is found by adding the
maximum data value to the minimum data
value and then dividing the sum by 2
Round-Off Rules
For the mean, median, and midrange, carry one more decimal
place than is present in the original set of values.
• For the mode, leave the value as is without rounding (because
values of the mode are the same as some of the original data
values).
Range ( Not resistant)
The range of a set of data values is the difference
between the maximum data value and the minimum
data value.
Range = (maximum data value) − (minimum data value)
STDV
The standard deviation of a set of sample values, denoted by s, is
a measure of how much data values deviate away from the mean.
Notation
s = sample standard deviation
(σ = population standard deviation
Important properties
• The standard deviation is a
measure of how much data values
deviate away from the mean.
• The value of the standard deviation
s is never negative. It is zero only
when all of the data values are
exactly the same.
• Larger values of s indicate greater
amounts of variation.
• The standard deviation s can increase
dramatically with one or more outliers
(not resistant).
• The units of the standard deviation s
(such as minutes, feet, pounds) are the
same as the units of the original data
values.
• The sample standard deviation s is a
biased estimator of the population
standard deviation σ, which means that
values of the sample standard deviation
s do not center around the value of σ
Population STDV
is just N for sample we do n-1
Variance of a sample and a population
The variance of a set of values is a measure of
variation equal to the square of the standard
deviation.
• Sample variance:
s² = square of the standard deviation s.
• Population variance:
σ² = square of the population standard
deviation σ
z scores
A z-score is the number of standard deviations that a given value x lies
above or below the mean
Properties of z score
unitless measurement
• A data value is “unusual” if its z-score is less
than -2 or greater than +2
• A negative z-score indicates that the
observation lies below the mean; a positive z-
score indicates that the observation lies above
the mean
Percentiles
Percentiles are measures of
location, denoted P1, P2, . . . ,
P99, which divide a set of data
into 100 groups with about 1%
of the values in each group
To find the percentile k of a data value x
Percentile of x = number of values less than x/total number of values x 100
9
If you are given percentile
L= (k/100)n k=percentil n= total number
Quartiles
are measures of location,
denoted Q1, Q2, and Q3, which
divide a set of data into four groups
with about 25% of the values in
each group. In short:
• Q1 = P25
• Q2 = P50 = MEDIAN!
• Q3 = P75
IQR
IQR= Q3-Q1
5 nbumber summary
• Minimum
• First quartile, Q1
• Second quartile, Q2 (same as the median)
• Third quartile, Q3
• Maximum
outlier
if it is above Q3, by an amount
greater than 1.5 × IQR or below Q1, by an amount
greater than 1.5 × IQR
Sample space
s the collection of all possible simple
events for a procedure, i.e., the set of all possible outcomes
Probability Notation
• P denotes a probability.
• A,B,C,…denote specific events.
• P(A) denotes the “probability of event A occurring.”
Three main ways to determine probability: Classical Approach
Assume that a procedure has n different
simple events, and that each simple event has an equal
chance of occurring
P(A) = Number of ways A occurs/ Number of different simple events
Three main ways to determine probability: Relative Frequency
Conduct/observe a
procedure, count the number of times the event occurred
P (A) = Number of times A occurred/ number of times the experiment was repeated
Three main ways to determine probability: Subjective probabilities
The probability of an event is
estimated by using knowledge of relevant circumstance.
Law of large numbers
As a procedure is repeated again and again, the relative frequency
probability of an event tends to approach the actual probability.
Important Principles for probability
• The probability of an event is a number between 0 and 1, inclusive.
• If it is impossible for an event to occur, then the probability it will occur is 0.
• If an event is certain to occur, then the probability it will occur is 1.
Complement of an Event
Definition: The complement of an event A, consists of all outcomes that do not
belong to A and is denoted by A with line over it
2 important rules
1. Rule of Complementary Events P (A) + P (A) = 1
or P (A) = 1 - P ( A with line) "
or P (A with line) = 1 - P (A)
2. The sum of all probabilities in a sample space must equal