In-Depth Notes on Modeling, Probability, and Statistical Analysis in Data Analytics
Modeling
- In data analytics, modeling involves representing the relationships between variables in order to gain insights and make predictions.
- Modeling helps in understanding how different elements interact and influence each other.
Feedback
- Feedback is a critical part of the data analytics process, emphasizing collaboration and improvement.
- Frequent feedback can enhance the quality and effectiveness of analyses.
- An iterative process ensures that analyses are continually refined and improved.
Stats Basics
- Statistics provides methodologies for collecting, analyzing, interpreting, presenting, and organizing data to make informed decisions.
Probability Theory
- Many processes have random outcomes, leading to unpredictability in results.
- Example: Rolling a six-sided die has outcomes with probabilities ranging from 0 (impossible) to 1 (certain).
- Probability quantifies the likelihood of specific outcomes occurring:
- For example, the probability of rolling a 1 on a six-sided die is $rac{1}{6}$ or approximately 16.67%.
- Understanding randomness is essential in many contexts, such as:
- College admissions
- Employment rates
- Customer behaviors
Average Inferences
- Statistics help us draw inferences about populations from samples due to the impracticality of measuring every individual.
- For example:
- Estimating the average height of a population or the impact of certain conditions on a specific behavior.
- Statistical confidence expresses how likely we are that conclusions based on samples reflect the larger population validity.
- E.g., being 97% confident in inventory accuracy based on sample audits.
Appropriate Sampling
- Measuring the entire population may not be feasible; hence representative sampling is critical.
- A representative sample should match the population's characteristics.
- Achieved via random sampling to avoid bias.
- Ask the following in your analysis:
- What is the population of interest?
- Does the sample truly represent that population?
Statistical Hypothesis Tests
- Start with a research question (RQ) that proposes a potential relationship or difference.
- Statistical tests evaluate the validity of hypotheses.
- A common significance threshold (p-value) is set at 0.05, indicating:
- 95% confidence that a finding indicates a true relationship.
- Accepting a 5% chance of concluding mistakenly.
Normal Distributions
- Many natural phenomena approximate a normal distribution, or bell curve, characterized by:
- Mean, median, mode being equal in a perfectly normal distribution.
- Most data clustering around the mean, tapering off symmetrically.
Variance
- Variance measures how much data varies from the mean.
- Greater variance indicates more spread in data points.
- Example: Men's height might range more than women's, affecting inferential statistics.
Statistical Power
- Statistical power refers to the probability of correctly rejecting the null hypothesis.
- Increases with sample size, effect size, and decreases with variance.
- High power reduces the risk of a Type II error (incorrectly failing to find a relationship).
T-Tests
- T-tests compare means (e.g., average height of men vs. women) to see if significant differences exist.
- One-sample tests compare average values to a known value, while two-sample tests compare means between two different groups.
Correlation
- Correlation measures relationships between variables, identifying how they move together (positive or negative).
- Example: Income may rise with age (positive correlation).
- Correlation coefficients range from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no correlation.
Regression Analysis
- Regression models the relationship between one dependent variable and one or more independent variables.
- Helps account for the influence of other factors on the dependent variable, allowing for cleaner insights.
- Model Format: Y=β<em>0+β</em>1X<em>1+β</em>2X<em>2+…+β</em>nXn+β
- Coefficients ($eta$) indicate the strength of the relationship between each variable and the outcome.
Addressing Endogeneity
- Endogeneity occurs when an explanatory variable is correlated with the error term, often complicating causal interpretations.
- Recognizing types of endogeneity:
- Omitted Variables - Missing factors that influence both X and Y.
- Simultaneity - Bidirectional relationship where X affects Y and vice versa.
- Selection Bias - Non-random selection affecting the outcome, leading to biased results.
Interactions and Fixed Effects
- Interactions can show how the effect of one variable changes based on another variable.
- Fixed effects help control for unmeasured factors that vary across observations but are not included in the model.
Difference-in-Difference
- A technique to compare changes over time between a treatment and control group, particularly useful in policy evaluation.
- Controls for time-invariant differences between groups to isolate the treatment effect.
Conclusion: Keys to Effective Data Analysis
- Ask critical questions about variables and designs, be open to feedback, and continuously refine analyses.
- Remember the mantra: correlation does not imply causation!
- Utilize regression and interactions thoughtfully to draw deeper insights and make meaningful conclusions.