OJ

Lecture Notes on Hypothesis Testing and Data Analysis

Price Inflation Problem

  • Data set: inflation AD inflation.
  • Question: Test if price inflation in the United States averaged 3% from 1948 to 2012.
  • Null hypothesis: "mu = 3"
  • Alternative hypothesis: "mu \neq 3"
  • Summary statistics:
    • Average inflation: 3.36
    • Standard deviation: 2.38
  • Test statistic calculation:
    • Formula: t = \frac{\text{sample mean} - \mu_0}{\frac{s}{\sqrt{n}}}
    • Calculation: t = \frac{3.36 - 3}{2.38 / \sqrt{259}} = 2.43
  • P-value calculation:
    • T-distribution with 258 degrees of freedom (n-1).
    • Shade the right tail since the test statistic is positive.
    • P-value = 2 * P(t > 2.43)
    • Using Stata command: ttail(258, 2.43)
    • P-value = 0.0157
  • Decision: Reject the null hypothesis because the p-value (0.0157) is less than 0.05.
  • Conclusion: There is evidence that the average inflation during that time period was not 3%.
  • Stata command: ttest inflation == 3
  • Standard error calculation:
    • Formula: "SE = \frac{s}{\sqrt{n}}"
    • Stata: display 2.38 / sqrt(259)
  • Test statistic via Stata: display (3.36 - 3) / SE
  • Degrees of freedom: 258
  • Alternative: mean is not 3 (two-sided test).

Earnings Data Set Problem

  • Initial hypothesis: Average earnings are 40,000.
  • Alternative hypothesis: Mu is different from 40,000 (two-sided test).
  • Revised question: Test if population mean earnings exceed 40,000.
  • Stating the question statistically: "mu > 40,000"
  • Opposite: "mu \leq 40,000"
  • Alternative hypothesis: "mu > 40,000" (one-tailed test).
  • Null hypothesis: "mu \leq 40,000"
  • For testing purposes, the null hypothesis is always stated as equality (e.g., "mu = 40,000").
  • Calculate the test statistic and p-value using the earnings data set.
  • The null is always stated as equality, and the value closest to the alternative (e.g., 40,000) is used for testing.
  • Use Stata as a calculator to get used to the operators.
  • Calculators with basic functions (add, subtract, multiply, divide) are sufficient for the exam.

Test Statistic Calculation (Earnings Example)

  • Test statistic formula: t = \frac{\bar{X} - \mu_0}{s / \sqrt{N}}
  • Given: " \bar{X} = 41413, \mu_0 = 40000, s = 25527, N = 171"
  • Calculation: t = \frac{41413 - 40000}{25527 / \sqrt{171}} = 0.72
  • Degrees of freedom: "df = N - 1 = 171 - 1 = 170"
  • P-value: P(T > 0.72), one-tailed test (alternative points to the right).
  • Stata command: display ttail(170, 0.72)
  • P-value ≈ 0.24
  • Decision: Fail to reject H0 at alpha = 0.05 because the p-value (0.24) > 0.05.
  • Conclusion: There is no significant evidence that the average earnings are more than 40,000.
  • Stata command (for the test): ttest earnings == 40000

IT Companies & Data Analytics

  • IT companies are using data analytics tests to evaluate job applicants.
  • These tests involve providing a dataset and questions to analyze.
  • Companies are increasingly using tests in programs like Excel to verify skills.
  • Need to be able to choose correct numbers from the output table to succeed.

Reshape Command (Data Transformation)

  • Reshape command changes the dataset from wide to long form.
  • Most programs cannot handle data in wide form.
  • Example: Zillow data on housing prices.
  • Data includes city and housing prices in different time periods.
  • Variable names in Stata must start with letters.
  • Add a letter (e.g., 'D') to the date to ensure it starts with a letter.
  • Reshape long command syntax:
    • reshape long D, i(city) j(year)
    • i: cross-sectional observation (city).
    • j: new variable created (year).
    • Reshape data columns that start with dates with 'd'.
    • D is Price of the houses.
  • After reshaping, one may need to merge data for different cities based on years for graphing.

Review of Key Concepts for Exam

  • Types of Data
    • Quantitative (continuous or discrete)
    • Qualitative/Categorical
  • Types of Observations
    • Cross-sectional
    • Time series
    • Panel
  • Data Presentation
    • Qualitative Data: Pie charts, bar graphs
    • Quantitative Data: Histograms, box and whisker plots
  • Descriptive Statistics
    • Mean: "\bar{x} = \frac{\sum x_i}{n}"
    • Median: Middle value in ordered sequence
    • Mode
    • Mid-range: (Largest - smallest) / 2
  • Shape of Data
    • Symmetric
    • Left skewed
    • Right skewed (skewness determined by the long tail)
  • Skewness Coefficient
    • Greater than 0: Right skewed
    • Less than 0: Left skewed
    • Equal to 0: Symmetric
  • Kurtosis
    • Equal to 3 for normal distribution
  • Range
    • Largest - smallest observation
  • Variance and Standard Deviation
    • Sample variance: Average of squared deviations from the mean, divided by n-1
    • Standard deviation: Square root of the sample variance
  • Coefficient of Variation (CV)
    • "CV = \frac{s}{\bar{x}}"
  • Interquartile Range (IQR): Not susceptible to outliers
  • Average Absolute Deviation
    • "AAD = \frac{\sum |x_i - \bar{x}|}{n}"
  • Interpreting Standard Deviation
    • Chebyshev's Theorem: Applies to any data
      • At least 75% of the data are within 2 standard deviations of the mean
      • At least 8/9 of the data are within 3 standard deviations of the mean
      • General rule: 1 - \frac{1}{k^2} of the data is within k standard deviations of the mean for k > 1
    • Empirical Rule: For bell-shaped and symmetric data
      • 68% of the data are within 1 standard deviation of the mean
      • 95% of the data are within 2 standard deviations of the mean
      • 99.7% of the data are within 3 standard deviations of the mean
  • Measures of Relative Standing
    • Percentiles, quartiles, Z-scores
  • Outliers
    • Observations more than 3 standard deviations away from the mean
    • Can be removed after examining their causes
  • Data Transformations: Logs
    * Linearizes exponential growth: if xt = x0 (1 + r)^t, then "log(xt) = log(x0) + t \cdot r"
    * Approximate percentage changes: "\Delta log(x) \approx \frac{\Delta x}{x}"
    * Rule of 72: Number of years to double investment: "n = \frac{72}{r}"
    * Logs are used to eliminate skewness of data for better analysis
  • Compound Interest Rates
    • Effective interest rate: (1 + r/n)^t - 1
  • Economics Data
    • Output measures: GDP, GNP
    • Price indices, price inflation calculation
      • Price Index = (Price of basket in current year/Price of basket in base year) * 100
    • Labor force, employment, unemployment, labor force participation rates
    • Financial data: interest rates, stock indices
    • Real vs nominal data, adjusted for inflation
    • Per capita GDP = GDP / population
    • Growth rates and percentage changes
  • Sampling Distribution
    • The mean of the sampling distribution is the same as the population mean
    • X-bar becomes a random variable with its mean and standard deviation
    • Standard error: "\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}"
    • If the population is normal, then X-bar will also be normal
    • If the population is not normal but n is large, the Central Limit Theorem (CLT) applies, and X-bar will still be normally distributed
    • Confidence interval formula: "\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}"
      • t_{\alpha/2} based on n-1 degrees of freedom
      • Interpretations of confidence intervals must include the level of confidence, bounds, and population parameter definition
  • Hypothesis Test
    • Identify null and alternative hypotheses
    • Calculate test statistic: "t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}"
    • Calculate p-value, reach decisions, and draw conclusions
    • Exam will include filling in blanks in the Stata output, requiring you to demonstrate calculations and interpretations using provided data. You need to read the ttail command and pick the appropriate one.
  • A calculator is needed for the in-class exam. Functions it should support is plus, minus, multiply, divide, and square root.