Business Data Management & Acquisition Notes

A distribution shows how all the values of a variable are spread out across the full range of possible values. It provides a visual representation of the data, often depicted through graphs such as histograms, box plots, or cumulative frequency distributions, allowing for easier interpretation of the data set.

A distribution summarizes your data, telling you:

  • What values exist in your data: This helps identify the different outcomes present within the data set.

  • How often each value shows up: Frequency counts provide insights into which values are common and which are rare, which can influence decision-making processes.

  • What is likely to happen in the future: Distributions allow analysts to make predictions based on historical data trends, determining probable future values based on established patterns.

Example: Customer counts over 100 days at a coffee shop can depict how busy the shop is at different times or under varying conditions, thereby assisting in staffing and inventory decisions.

Expected Value from a Distribution

Expected value is the weighted average of all possible outcomes in a dataset or distribution. It plays a critical role in decision-making by providing a single summary measure of the potential outcomes.

Formula:

  • With probabilities: EV=<em>i=1nx</em>i×piEV = \sum<em>{i=1}^{n} x</em>i \times p_i

  • With frequencies: EV=<em>i=1nx</em>i×fiNEV = \sum<em>{i=1}^{n} x</em>i \times \frac{f_i}{N}

This calculates the average result or value expected based on occurrences in the dataset and provides valuable insights into the potential profitability or successfulness of various strategies.

Uses of Expected Value
  1. A/B Testing: Comparing the expected value of different versions of an ad is crucial for optimizing marketing strategies.

    • Example: If version A of an ad gets a 2% click rate on TikTok and version B gets a 3% click rate, combined with revenue per purchase being $150 for A and $70 for B, it is important to evaluate the overall effectiveness.

    • Calculate the $ per view (EV) for both ads:

      • EV for version A = $3 per view and version B = $2.1 per view.

    • The example highlights the importance of considering both click-through rates and revenue per purchase when evaluating ad performance using expected value.

Trusting the Test

How much can we trust these two EVs and the difference? The reliability of the expected values and their difference depends on:

  • Sample Size: The versions could have been tested on as many as 10,000 people or as few as 5. Larger sample sizes generally lead to more reliable results because they better represent the overall population.

  • Variation Within EVs: Greater variation within the data can affect the stability and reliability of the expected values, indicating how consistent or variable the observed outcomes are.

T-test

A t-test is a statistical test that compares the means of two groups to determine whether the observed difference is likely due to chance or reflects a real effect. It is a fundamental method in inferential statistics that helps validate hypotheses about population parameters.

The outcome of the t-test depends on:

  • The difference in group means

  • The variation within each group, which influences the spread of the data

  • The sample size of each group, affecting the power of the test

  • Two outcomes: t-statistics and p-value, both of which provide critical insights into the data.

T-statistic

The T-statistic is a signal-to-noise ratio and is calculated using the formula:
t=difference between groups’ averageshow much the data naturally variest = \frac{\text{difference between groups' averages}}{\text{how much the data naturally varies}}

  • The numerator reflects the difference between your groups’ averages, indicating how distinct the groups are.

  • The denominator scales it based on how much the data naturally varies and the size of your samples.

  • Intuition: It provides insight into how significant or pronounced the difference is relative to the expected variability in the data.

  • A larger t-test value suggests that the difference between your groups is substantial when considered against the natural variation present in the dataset.

P-value

The p-value represents the probability of observing a result as extreme as (or more extreme than) your data, assuming the null hypothesis is true. It helps in determining the significance of the results obtained from statistical analysis.

  • Intuitively: It conveys how likely it is to see a result like this (or more extreme) if there were no real effect at play. A smaller p-value indicates it is unlikely to observe such a difference under the null hypothesis.

  • In this class, we consider p <= 0.1 to be ‘statistically significant,’ reflecting strong evidence against the null hypothesis.

Common Data Issues

In data analysis, several common data issues can impact the accuracy and reliability of results:

  1. Typos & Inconsistencies: Errors due to manual data entry can lead to anomalies in the analysis.

  2. Wrong Data Types: Data that is improperly categorized can skew results.

  3. Duplicates & Identifier issues: Repeated entries can inflate counts and mislead conclusions.

  4. Missing Values: Gaps in data lead to incomplete analysis and need to be addressed.

  5. Outliers & Implausible Values: Extreme values can distort statistical measures.

Missing Values

When dealing with missing data, two scenarios arise:

  1. Actual Missing Values: Points that were not recorded or are blank.

  2. Missing for a Reason: Certain values might be absent due to underlying reasons, leading to questions about their significance.

What Can We Do to Retain as Much Information as Possible?
  • Options for Handling Missing Values:

    • Drop the rows where missing data occurs.

    • Fill in missing values with a reasonable default or calculated mean.

    • Turn the variable into a binary variable (indicator) indicating presence or absence.

    • Turn the variable into a categorical variable, allowing for more than two outcomes.

Outliers

Identifying real outliers from errors is critical for data integrity. Considerations include:

  • Does the outlier make sense within the context of the data?

  • Is it consistent with other variables in the analysis?

  • Is it isolated, or does it imply a widespread issue?

  • Utilize industry knowledge to assess the validity of the outlier.

Verification and Investigation
  • Verify and investigate potential outliers carefully.

  • Options for handling outliers include:

    • Capping or clipping extreme values.

    • Removing outliers if deemed erroneous.

    • Log-transforming data to mitigate the impact of extreme values and normalize distributions.