In-Depth Notes on Modeling, Probability, and Statistical Analysis in Data Analytics

In data analytics, modeling involves representing the relationships between variables in order to gain insights and make predictions.
Modeling helps in understanding how different elements interact and influence each other.

Feedback is a critical part of the data analytics process, emphasizing collaboration and improvement.
Frequent feedback can enhance the quality and effectiveness of analyses.
An iterative process ensures that analyses are continually refined and improved.

Statistics provides methodologies for collecting, analyzing, interpreting, presenting, and organizing data to make informed decisions.

Many processes have random outcomes, leading to unpredictability in results.
- Example: Rolling a six-sided die has outcomes with probabilities ranging from 0 (impossible) to 1 (certain).
Probability quantifies the likelihood of specific outcomes occurring:
- For example, the probability of rolling a 1 on a six-sided die is $rac{1}{6}$ or approximately 16.67%.
Understanding randomness is essential in many contexts, such as:
- College admissions
- Employment rates
- Customer behaviors

Statistics help us draw inferences about populations from samples due to the impracticality of measuring every individual.
For example:
- Estimating the average height of a population or the impact of certain conditions on a specific behavior.
Statistical confidence expresses how likely we are that conclusions based on samples reflect the larger population validity.
- E.g., being 97% confident in inventory accuracy based on sample audits.

Measuring the entire population may not be feasible; hence representative sampling is critical.
A representative sample should match the population's characteristics.
- Achieved via random sampling to avoid bias.
Ask the following in your analysis:
- What is the population of interest?
- Does the sample truly represent that population?

Start with a research question (RQ) that proposes a potential relationship or difference.
Statistical tests evaluate the validity of hypotheses.
A common significance threshold (p-value) is set at 0.05, indicating:
- 95% confidence that a finding indicates a true relationship.
- Accepting a 5% chance of concluding mistakenly.

Many natural phenomena approximate a normal distribution, or bell curve, characterized by:
- Mean, median, mode being equal in a perfectly normal distribution.
- Most data clustering around the mean, tapering off symmetrically.

Variance measures how much data varies from the mean.
Greater variance indicates more spread in data points.
Example: Men's height might range more than women's, affecting inferential statistics.

Statistical power refers to the probability of correctly rejecting the null hypothesis.
- Increases with sample size, effect size, and decreases with variance.
High power reduces the risk of a Type II error (incorrectly failing to find a relationship).

T-tests compare means (e.g., average height of men vs. women) to see if significant differences exist.
One-sample tests compare average values to a known value, while two-sample tests compare means between two different groups.

Correlation measures relationships between variables, identifying how they move together (positive or negative).
Example: Income may rise with age (positive correlation).
Correlation coefficients range from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no correlation.

Regression models the relationship between one dependent variable and one or more independent variables.
Helps account for the influence of other factors on the dependent variable, allowing for cleaner insights.
- Model Format: $Y = \beta0 + \beta1X1 + \beta2X2 + … + \betanX_n + \beta$
Coefficients ($eta$) indicate the strength of the relationship between each variable and the outcome.

Endogeneity occurs when an explanatory variable is correlated with the error term, often complicating causal interpretations.
Recognizing types of endogeneity:
1. Omitted Variables - Missing factors that influence both X and Y.
2. Simultaneity - Bidirectional relationship where X affects Y and vice versa.
3. Selection Bias - Non-random selection affecting the outcome, leading to biased results.

Interactions can show how the effect of one variable changes based on another variable.
Fixed effects help control for unmeasured factors that vary across observations but are not included in the model.

A technique to compare changes over time between a treatment and control group, particularly useful in policy evaluation.
Controls for time-invariant differences between groups to isolate the treatment effect.

Ask critical questions about variables and designs, be open to feedback, and continuously refine analyses.
Remember the mantra: correlation does not imply causation!
Utilize regression and interactions thoughtfully to draw deeper insights and make meaningful conclusions.