1/164
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Response Variable
What we would like to predict
Explanatory Variable
Variables used to calculate predictions
Example of Response Variable
Amount spent by online customers
Example of Explanatory Variables
Number of employees, type of industry, etc.
Rows are
Rows are Horizontal
Columns are
Columns are vertical
Cases are in
Cases are in rows
Variables are in
Columns
Quantitative Variable
Tells us how much of something was measured and quantifies exactly how far apart individual items are
Examples of Quantitative Variables
Height, weight, salary, score, distance, time, GPA
Categorical Variable
Separate distinct categories that can’t specify exactly how far apart 2 items are, or do math to compute the average
Examples of Categorical Variables
Gender, race, nationality, hair color, student ID, class grade, zip code
Identifier
A unique code assigned to each individual or item, listed in the first column of the data table
Examples of Identifiers
Social security, student ID numbers, transaction numbers
Time Series Data
Data that consists of the same item measured repeatedly
Example of Time Series Data
The price of Bitcoin at the end of each day for a year, Monthly Inflation Rate
To qualify as time series, we should
To qualify as time series, we should be able to plot the data as a line with time on the X-axis
Cross Sectional Data
Data that is measured only once
Examples of Cross Sectional Data
A household income study for one year, health snapshots
What is EDA?
Exploratory data analysis: examines data for patterns, underlying structure, trends, deviations from the trend etc
To Display Categorical Data in R use:
Bar Charts or Pie Charts
Bar Charts
Shows the counts for each category
How to make a Bar Chart:
barplot(table(your variable), main = “Your Title”, xlab = “Your Label”)
Pie Charts
Pie charts should be used when the focus is on percentages rather than actual counts, (market share)
How to make a Pie Chart:
pie(table(Your Variable), main = “Your Title”)
To display Quantitative Variables
To display Quantitative Variables, use a Histogram or Boxplot
How to make a Histogram
hist(Your Variable, main = “Your Title”, xlab = “Your Label”)
How to make a Boxplot
boxplot(Your Variable, main = “Your Title”, xlab = “Your Label”)
Modes
Peaks or humps seen in a histogram are called the modes
Unimodal
A distribution whose histogram has one main peak
Bimodal
A distribution whose histogram has two main peaks
Multimodal
A distribution whose histogram has three or more peaks
Uniform Histogram
All the bars are approximately the same height, there is no mode
Symmetric
A distribution is symmetric if the halves on either side of the center look approximately like mirror images
Skewness
If one tail stretches out longer than the other, the distribution is said to be skewed to the side of the longer tail
Mean
The average, used to measure the typical value for unimodal, symmetric distributions
Median
If the data set is skewed or contains outliers: it is better to use the median as a measure of the “typical value”
Consider the following salaries:
56, 46, 48, 60, 150 which is the best measure of the typical salary ?
The median of 56
If a distribution is roughly symmetric, the ___ and ____
If a distribution is roughly symmetric, the mean and median will be reasonably similar
But in a skewed distribution
But in a skewed distribution, the mean always gets pulled towards the longer tail
The more ___ the values the…
The more spread out the values, the bigger the prediction errors and the less accurate the statistical models
The two main measures of spread are:
Standard deviation and IQR
Standard Deviation
A measure of the average distance of points from the mean or center
The more spread out the points the farther the…
The more spread out the points, the farther the average distance from the mean and the greater the standard deviation
If all points are close to the center
If all points are close to the center, the standard deviation will be small
Standard deviation is very sensitive to
Outliers
IQR
The IQR, Q3-Q1, indicates how far apart the middle 50% is spread out
When should mean and standard deviation be used?
The mean and standard deviation should be used when the shape is unimodal, symmetric, and no outliers are present
When should the Median and IQR be used?
The median and IQR should be used if the shape is skewed, or if there are outliers
If ALL the data points increase/decrease by a constant value
The mean and median will increase/decrease. The IQR and SD will stay the same
If SOME of the data points increase/decrease
The median and IQR stay the same as long a none of the points cross the median, Q1 or Q3
Five Number Summary
Min, Q1, Median, Q3, Max
Boxplot: If the median is exactly halfway between Q1 and Q3 the data is
If the median is exactly halfway between Q1 and Q3 the data is symmetric
Boxplot: If the Median isn’t exactly symmetric
The data is skewed in the direction of the longer distance
When the whiskers have different lengths, the longer whisker also indicates the direction of the
When the whiskers have different lengths, the longer whisker also indicates the direction of the skewness
Parallel Boxplots provide a good method of…
Parallel boxplots provide a good method of comparing a quantitative variable across different categories of another variable
tapply function
tapply(variable to be analyzed, grouping variable, function)
Z-Score
Tells us how many standard deviations a data point is from its mean
Formula for Z score
z = value - mean / SD
Time Series Plot
A graph of a time series data set, a special type of line graph in which the X-axis is time
Probability
A useful tool to quantify uncertainty, providing an objective rationale for decision-making
Example of Probability in Business Sales
Sales: determine which factors increase the probability of making a sale
Example of Probability in Business Accounting:
Accounting: Identify scenarios most likely to involve fraud to efficiently allocate investigative resources
Example of Probability in Risk Management:
Risk Management: Calculate the probability of various disruptions and how severely they would affect the company
Probability can be interpreted as
Probability can be interpreted as the long-run frequency of events to occur
Basic Probability Rules: #1
Probability is a number between 0 and 1
Basic Probability Rules: #2
Probabilities sum to 1
Basic Probability Rules: #3
The Complement rule: P(A) = 1 - P(A^c)
Basic Probability Rules: #4
The Addition rule: P(A or B) = P(A) + P(B)
Basic Probability Rules: #5
The General Addition rule: P(A or B) = P(A) + P(B) - P(A intersection B)
Conditional Probability
Conditional probability is the probability of one event (A), given that another event (B) is known to have occurred
Conditional Probability Formula
P(A|B) = 𝑃(𝐴 ∩ B) / P(B)
Example of Conditional Probability Scenario
Suppose you run an online retail business. You want to understand how likely a customer is to make a purchase given that they have added items to their shopping cart
Independent Events
Events are said to be independent if the probability of one event occurring has no effect on the probability of the other
Multiplication Rule
Multiplication Rule says that A and B are independent if: P (A and B) = P(A) x P(B)
Random Variables
A random variable specifies the probability of outcomes which are random (not known with certainty)
Example of Random Variables
An inflation rate in 5 years’ time, the monthly sales of a particular cellphone in 2 years’ time
Discrete Variables
A discrete variable is a numerical variable that takes only specific, countable values
Discrete Random Variable
A discrete random variable is a variable that counts outcomes of a random process and can take only specific, separate values
Continuous Variable
A continuous variable is a numerical variable that can take infinitely many possible values within a given interval
Normal Distribution
A normal distribution is a bell-shaped, symmetric distribution where most values cluster around the average, and fewer values occur as you move away from the center
Normal Distributions follow the
Normal distributions follow the 68-95-99.7 Rule
68-95-99.7 Rule
68% of the values fall within 1 SD of the mean
95% of the values fall within 2 SDs of the mean
99.7% of the values fall within 3 SDs of the mean
Normal Probabilities in R
pnorm(value, mean, SD) gives the lower probability by default
How do we get the upper probability of a Normal Distribution Model
To get the upper probability, we can either subtract the answer from1, or include the option lower.tail=F
Normal Distribution Cutoff Values
A cutoff value is the value of a variable corresponding to a specified percentile or probability in a normal distribution
Example of Normal Distribution Cutoff Values
Finding top 10% or bottom 5% of performers
Cutoff Values in R
qnorm(left tail probability, mean, SD)
Expected Value
The expected value, E(X), of a random variable X is the mean or average value of X over all possible outcomes
The Standard Deviation of a Random Variable
The standard deviation of a random variable is its long-run average deviation from the mean, where each deviation is weighted by its probability to occur
Variance of a Random Variable
The variance of a random variable is the average squared deviation from the mean, with each deviation weighted by its probability to occur
Law of Large Numbers
The law of large numbers states that as the sample size increases, the sample mean will converge to the mean of the population. Thus, larger sample sizes are guaranteed to produce results that are close to the population, while smaller sample sizes might have a mean that is considerably different
Empirical Distribution
An empirical distribution is the distribution of a dataset based on observed values and their frequencies or proportions
Probability Distribution Basis
Theoretical Model
Probability Distribution Information
Provides complete probability information regarding all outcomes
Probability Distribution Constant
Yes: based on theoretical assumptions
Empirical Distribution Basis
Observed data from past observations
Empirical Distribution Information
Based on previously observed data. Future outcomes could differ from the past
Empirical Distribution Constant?
No: will change when collecting different data sets
Covariance
Measures the degree to which two random variables move in the same or opposite directions