1/98
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data
the facts & figures collected, analyzed, and summarized for presentation and interpretation.
Dataset
all the data collected for a particular analysis
Element
the entity on which data is collected
Variable
a characteristic of interest of an element
Observation
the variables associated with an individual element
Categorical
use numeric or ordinal values of measurement of categories
Quantitative
use numeric (quantitative) measures
Cross-sectional
data collected at a similar point in time
Time Series
data collected over several time periods
Panel
combination of cross-sectional and time series data
Descriptive
describe data or variables
Population
is the set of all data/variables of a statistical analysis
Sample
is a subset of the population
Statistical Inference
uses data from a sample to make estimates and test hypothesis about the characteristics of a population
Row 1 contains the __; column A contains the __; the rest of the worksheet contains the __
variables names; elements; data in the dataset
Descriptive Analytics
which describe what has happened in the past
Predictive Analytics
uses statistical models from past data to predict the future [forecasting] or access the impact of one variable on another [inference]
Prescriptive Analytics
uses models seeking to find a best (optimal) solution. Often these are some type of optimization model
Volume
the number of observations
Velocity
the speed at which data is collected
Variety
the forms of data are of different types
Veracity
the reliability of the data generated
Data Mining
focuses on extracting predictive information from big data
Frequency Distribution
a tabular summary of data showing the number (i.e. frequency) of observations in each of several non over-lapping categories
Relative Frequency
frequency of a class divided by n of a class (total)
Percent Frequency
relative frequency x 100
Bar Chart and Pie Chart
a visual display of frequency; relative frequency & percent frequency distributions
Histogram
A visual display of a frequency, relative frequency or percent frequency distribution, where the variable of interest is on the horizontal axis and the frequency, relative frequency or percent frequency is on the vertical axis
Cumulative Percent Frequency Distribution
Shows proportion/percentage of data items with values less than or equal to the upper limits of each class
Number of Classes
Between 5 and 20
Small datasets have less; larger datasets have more
Width of the Class
Generally, the same for each class
Approx. width = (largest value - smallest value) / # of classes
Class Limits
Each data observation must only belong to one class
Relative Frequency Distribution
Frequency of the class / n
Crosstabulation
a tabular summary of data for two variables (either categorical or quantitative)
suppose we have data from a sample of 300 restaurants on overall quality and the meal price. (This allows us to see if there is a pattern the two variables)
Scatter Diagram & Trendline
a scatter diagram is a graphical display of the relationship between two quantitative variables
a trendline provides an approximation (i.e. an estimate) of the relationship; which can be positive, negative or none
Side-by-Side & Stacked Bar Charts
These are extensions of a basic bar chart as they are used to display and compare two variables.
A side-by-side bar chart depicts multiple bar charts on the same display
A stacked bar chart has one bar broken into segments of a different color showing the relative frequency of each class
Mode
is the value that occurs with the greatest frequency. If there are two values that are most frequent the variable is bi-modal; if there are more then itโs multi-modal
Geometric Mean
A measure of location by finding the nโth root of the product of n values
Percentile
provides information about how the data is spread over the interval from the smallest to the largest value
Quartiles
represent how the data is spread over four parts, each containing approximately 25% of the observations
Range
largest value - smallest value
a measure of variability or dispersion of the data
Interquartile Range
Q3 - Q1, is the range of the middle 50% of the data
a measure of variability or dispersion in the data
Variance
measures variability using all the data, since it is based on the difference between the value of xi and the mean
The difference is called deviation about the mean
For a sample, the deviation is xi - ๐ฅ^-
For a population, a deviation is xi - ๐
Distribution Shape
is measured by skewness
if the shape of the data is skewed to the left, the skewness is negative (mean < median)
if to the right then skewness is positive (mean > median)
if the data is symmetric, then skewness is zero (mean = median)
Coefficient of Variation
This is a measure of how large the standard deviation is relative to the mean
Z-Score
measures the relative location of values in the dataset, helps determine how far a particular value is from the mean
yields a standardized value and is the # of standard deviation from the mean
a measure of the relative location of the observation in the dataset
uses mean and std. deviation in calc.
Chebyshevโs Theorem
allows us to make statements about the population of the data values that must be within a specified # of standard deviation from the mean
If z = 2, 75% of data must be within 2 std. dev. of the mean
If z = 3, 89% of data must be within 3 std. dev. of the mean
If z = 4, 94% of the data must be within 4 std. dev. of the mean
If data is bell shaped around the mean, we know:
Approx. 68% of the data is within one s of sample mean (x^-)
Approx. 95% of the data is within two s of sample mean (x^-)
Approx. 99.7% of the data is within three s of sample mean (x^-)
Detecing Outliers
extreme values relative to the rest of the data
z-score can help identify outliers, any z-score greater than |3| is an outlier
Interquartile Range can also help
Covariance
is a descriptive measure of the linear association between two variables
Sxy = sample covariance,
if Sxy > 0, then there is positive linear association between x and y
if Sxy < 0, then there is negative linear association between x and y
Sample Correlation Coefficient
ranges from -1 to +1
If 1, then all data is on a positively sloped line
-1 = data would be on a negatively sloped line
As the data moves from the slope of the line, the correlation coefficient moves closer to 0
Probability
a numerical measure of the likelihood of an event occurring
a probability ranges from 0 to 1
Experiment
a process generating well-defined outcomes
ex: rolling a 6-sided die results in six possible outcomes: S = {1,2,3,4,5,6}
Combinations
A counting rule allowing one to count the # of experimental outcomes when selecting n objects from a set of N objects
Permutations
A counting rule computing the # of experimental outcomes when n objects are to be selected from a set of N objects where the order is important
Requirements of Assigning Probabilities
The probability assigned to each outcome must be between 0 and 1
The sum of the probabilities for all outcomes must be equal to 1
Classical Method
coin toss, or a roll of a 6-sided die
outcomes are divided by total possibilities
Relative Frequency Method
used when data are available to estimate the proportion of time the experimental outcomes will occur if the experiment is repeated a large # of times
Subjective Method
used when outcomes are not equally likely and data is unavailable
Probability of an Event
the probability of an event is equal to the sum of the probabilities of the sample points in the event
P(C) = P(2,6) + P(2.7) + P(3.6)
P(C) = 0.15 + 0.15 + 0.10 = 0.35
P(S) = P(2,6) + P(2,7)
P(S) = 0.15 + 0.15 = 0.30
Union of Two Events
the event containing all sample points belonging to Event A, Event B or both
denoted by A u B (whole bubble diagram)
Intersection of Two Events
the event containing the sample points belonging to both A and B
denoted by A n B (only the middle of the bubble diagram)
Addition Law
useful when we want to know the probability that at least one of two events occur
P(A u B) = P(A) + P(B) - P(A n B)
Mutually Exclusive Events
occur when two events have no sample points in common
Conditional Probability
probabilities are often influenced by whether a related event already occurred.
support A occurs with P(A). if event B already occurred, this new info will result in a new probability for A, and called the conditional probability: P(A|B)
Joint Probability
the probability of the intersection of two events
Random Variable
a numeric description of the outcome of an experiment and is either discrete or continuous
Bivariate Probability Distribution
two random variables
Marginal Probabilities
the sum of the joint probabilities (by row and column)
Independent Events
Event A and Event B are independent if: P(A|B) = P(A)
or P(B|A) = P(B)
Multiplication Law
used to compute the probability of the intersection of two events
Discrete Random Variable
a finite number of values or an infinite number of values such as 0, 1 ,2โฆ
example are a toss of a coin, the # of customers who place an order, or the product chosen by a customer from two options
Continuous Random Variables
any numerical value in an interval or collection of intervals
example are the time a customer visits a webpage, ounces in a soft drink, the value of a stock in one year
Variance
measures the variability or dispersion of the random variable
Standard Deviation
the positive square root of the variance
Bivariate Probability Distribution
involves two random variables, such as rolling a die two times or recording the percentage change for a stock fund and a bond fund over a year
often the analyst is interested in the relationship between the two random variables, look at covariance and correlation coefficient as measures of the linear association between the two
Binomial Probability Distribution
is based on four properties:
the experiment consists of sequence of n identical trials
two outcomes are possible on each trial; success or failure
the probability of success (p) and the probability of failure (1-p) does not change from trial to trial
the trials are independent
Using Excel to Compute Binomial Probabilities
Enter formula, Binom.Dist
needs a value for x, n, and p
mark either true (cumulative probability) or false (probability)
ex. =Binom.Dist(B5,$D$1,$D$2,FALSE)
Poisson Probability Distribution
this distribution relates to the case for estimating the # of occurrences over a specified interval of space/time
Using Excel to Compute Poisson Probabilities
Enter formula, Poisson.Dist
need a value of x, the value of ฮผ (mean), and TRUE (cumulative probability) or FALSE (probability)
ex. =Poisson.Dist(A4,$D$1,FALSE)
Hypergeometric Probability Distribution
similar to the binomial distribution, except the trials are not independent and the probability of success changes from trial to trial
r is the # of success in population N, and N-r is the # of failures
Using Excel to Compute Hypergeometric Probability
Enter Hypergeom.Dist
needs value for x, ฮผ (mean), r, and a value of N, and either TRUE or FALSE
=Hypergeom.Dist(1,3,5,12,TRUE)
Continuous Random Variable
computed differently than a discrete random variable
for discrete, we compute the probability at a specific value of x
for continuous random variables, we compute the probability that the random variable assumes any value in an interval
computing the area under the probability density function, f(x)
Difference Between Discrete and Continuous Random Variables
discrete random variables are computed where the random variable takes on specific value; continuous random variables are computed where the random variables is within an interval
the probability of a continuous random variable within some given interval is defined to be the area under the graph of the probability density function
(a single point is an interval of 0, so the probability of a single value in the continuous case is 0)
Using Excel to Compute Exponential Probabilities
Enter Expon.Dist
needs x, a value for 1/ฮผ, and TRUE or FALSE
=Expon.Dist(18,1/15,TRUE);
=Expon.Dist(18,1/15,TRUE)-Expon.Dist(6,1/15,TRUE);
=1 - Expon.Dist(8,1/15,TRUE)is zero
Normal Probability Distribution
most used probability distribution for continuous random variables
it provides a description of likely results obtained through sampling
bell curve
Characteristics of the Normal Distribution
only two parameters: ฮผ and ฯ
highest point is the mean, which is also the median and the mode
the mean can take on any numerical value
the normal distribution is symmetric; skewness = 0
the std. dev. (ฯ) determines how flat or wide the curve is (larger ฯ = wider/flatter curves)
probabilities for a normal random variable are given by the are under the normal curve (total area under the curve = 1)
68.3% = 1 from ฮผ, 95.4% = 2 from ฮผ, 99.7% = 3 from ฮผ
Using Excel to Compute Normal Probabilities
Enter Norm.Dist
find value for x, ฮผ (mean), and standard deviation, and TRUE/FALSE
lower tail: =Norm.Dist(20000,36500,5000,TRUE)
interval: =Norm.Dist(40000,365000,5000,TRUE) - Norm.Dist(20000,36500,5000,TRUE)
upper tail: =1 - Norm.Dist(40000,36500,5000,TRUE)
Using Excel to Compute Normal Probabilities (but for value of x)
x value with 0.10 in lower tail: =Norm.Inv(0.1,36500,5000)
x value with 0.025 in upper tail: =Norm.Inv(0.975,36500,5000)
Standard Normal Probability Distribution
is where the ฮผ is 0 and the std. dev. is 1
Using Excel to Compute Standard Normal Probabilities and Z-Values
Enter Norm.S.Dist
find value of z and TRUE or FALSE (WE USE TRUE)
P(z) <= V; V=1: =Norm.S.Dist(1,TRUE)
P(z) V1 <= z <= V2; if V1 = -0.5 and V2 = 1.25: =Norm.S.Dist(1.25,TRUE) - Norm.S.Dist(-0.5,TRUE)
P(z) >= V; if V = 1.58: =1-Norm.S.Dist(1.58,TRUE)
Using Excel to Compute Standard Normal Probabilities and Z-Values (but for z-values)
z-value with 0.025 in lower tail: =Norm.S.Inv(0.025)
z-value with 0.025 in upper tail: =Norm.S.Inv(0.975)
Using Excel to Calculate E(x) or ฮผ, ฯ^2 & ฯ
Mean: =sumproduct(A:A,B:B)
Squared Deviation from Mean: =(A2 - F$2)ยฒ
Variance: =sumproduct(C:C,B:B)
Standard Deviation: =sqrt(B11)
Using Excel to Calculate the Sample Covariance and Sample Correlation Coefficient
Enter =covariance.s(
and select the cells
Enter =correl(
and select the cells
Using Excel to Compute the Geometric Mean
=geomean(
select the cells
Using Excel to Compute Percentiles & Quartiles
Enter =Percentile.Exc(
select the cells
Enter =Quartile.Exc(
select the cells
Using Excel to Calculate the Sample Variance and Sample Standard Deviation
Enter =var.s(
select the cells
Enter =stdev.s(
select the cells
Using Excelโs Descriptive Statistics Tool
Apply Tools:
click on Data in the Ribbon
click on Data Analysis
choose Descriptive Statistics
Using Excelโs Recommended Chart Tools to Construct a Histogram (to show a class with no data)
right click any cell in the row labels column
click field settings
click Layout and Print
choose show items with no data; click OK