Lecture Notes on Statistical Analysis and Research Methods

Categorical Variables

  • Definition: Represent distinct groups or categories.

  • Examples:

    • Gender

    • Types of products

    • Education level

  • Measurement Scale: Nominal, Ordinal

  • Analysis Method: Frequency analysis

Continuous Variables

  • Definition: Measured on a continuous numerical scale.

  • Examples:

    • Height

    • Weight

    • Temperature

    • Income

  • Measurement Scale: Interval, Ratio

  • Analysis Method: Descriptive statistics

Univariate Exploratory Data Analysis (EDA)

  • Components:

    • Frequency or descriptive statistics

    • Data visualizations

  • Goals:

    1. Describe data

    2. Understand the distribution

    3. Identify outliers

Bivariate Analysis

  • Definition: Examines the relationship between two variables

  • Methods for Two Categorical Variables:

    • Summarization Method: Cross-tabulation

    • Hypothesis Testing Method: Chi-square test of independence

Bivariate Analysis (One Continuous & One Categorical)

  • Summarization Method: Comparing means between groups

  • Hypothesis Testing Methods:

    • Independent samples t-test

    • ANOVA

Correlation

  • Definition: Correlation $(p)$ measures the strength of the linear relationship between two continuous variables.

  • Range of Values: Always between -1 and 1

Positive Correlation

  • If correlation is greater than 0:

    • Statistical Analysis Methods:

    1. Therapy pool usage and visit frequency: Independent samples t-test

    2. Visit frequency and membership fees: Correlation coefficient

    3. Median age and therapy pool usage: Chi-square test

Correlation vs. Regression

  • Correlation: Measures strength of linear relationship between two continuous variables.

    • Correlation coefficient ranges from -1 to 1

  • Regression: Aims to establish a linear equation between variables.

    • Involves estimating both intercept and slope parameters.

    • Independent variables do not need to be continuous.

    • A linear equation can predict the outcome variable; correlation cannot be used for prediction.

Regression Analysis

  • Uses:

    • To assess if there's a statistically significant relationship between variables.

    • To measure the effect of one variable on another.

    • To predict the outcome variable.

    • To control for other variables when making assessments.

  • Applications:

    • Understanding how predictor variables influence dependent variables.

    • Predicting/forecasting dependent variables based on predictor variables.

Examples in Regression Analysis

  • Price and demand data

  • AFC data

Steps in Regression Analysis

  1. Plot scatter diagram.

  2. Formulate general model.

  3. Estimate parameters.

  4. Estimate regression coefficients.

  5. Test for significance.

  6. Determine strength and significance of association.

  7. Examine residuals.

Evaluating Regression Significance

F-Test for Model Validity

  • Questions:

    • Is the overall regression model significant?

    • What is the probability that at least one of the estimated coefficients differs from zero?

  • Null Hypotheses ($H0$): $B{intercept} = 0$

  • Alternative Hypotheses ($Ha$): At least one of $\beta{intercept}$ or $\beta_{price}$ is not zero.

  • Test Statistic Formula: F = \frac{MS{regression}}{MS{residual}} Where:

    • $MS_{regression}$ = Sum of Squares Error (SSE) for regression / degrees of freedom for regression (dfr).

    • $MS_{residual}$ = SSE residual / degrees of freedom residual (dfres).

  • Decision Rule: Reject $H_0$ if $p < 0.05$.

t-Test for Price Coefficient Validity

  • Questions:

    • Does a specific predictor have a significant effect on the dependent variable?

    • Example of Demand Relationship: $Demand = a + b1 * Pricei + e_i$, where $i$ refers to the observation.

  • Two-tailed Test:

    • $H0$: $\beta{price} = 0$

    • $Ha$: $\beta{price} \neq 0$

  • One-tailed Test:

    • $H0$: $\beta{price} = 0$

    • $Ha$: $\beta{price} < 0$

    • $p_{price} = 0$

  • Test Statistic Formula:
    t = \frac{p{price}}{SE{price}}

  • Decision Rule: Reject $H_0$ if $p < 0.05$.

Predicting Demand and Profit with Linear Fit

  • Equation:
    Demand = 10.127938 - 0.8955541 * Price

  • Summary of Fit:

    • R-Squared: $0.867979$

    • Adjusted R-Squared: $0.863578$

    • Root Mean Square Error: $0.618313$

    • Mean of Response: $6.1875$

    • Observations: $32$

Lack of Fit Analysis of Variance


  • Analysis of Variance Breakdown:

    Source

    DF

    Squares

    Mean Square

    F Ratio


    Model

    1

    75.405658

    75.4057

    197.2362


    Error

    30

    11.469342

    -

    -


    Total

    31

    86.875000

    -

    -

    • Statistical Significance: $Prob > F < 0.0001$.

    Parameter Estimates


    • Estimation Table:

      Term

      Estimate

      Std Error

      t Ratio

      P>

      t


      Intercept

      10.127938

      0.301115

      33.63

      <0.0001


      Price ($)

      -0.895554

      0.063767

      -14.04

      <0.0001

      Marketing Decisions Using Regression Model

      • Strategy: Find price point that maximizes predicted profit.

      • Predicted Profit Formula: Predicted\,Profit = (Price - MC) * (Predicted\,Demand)

        • If $MC = 0$, then:
          Predicted\,Profit = Price\,*\,Predicted\,Demand

      • Use estimated regression equation to obtain predicted demand:
        Demand = 10.128 - 0.896 * Price

      Strategies for Missing Data

      1. Drop missing observations:

        • Acceptable if remaining observations represent the population for the dependent variable.

      2. Replace missing observations with zeros:

        • Overstates occurrence of zero.

      3. Replace missing observations with averages:

        • More realistic, but may distort middle of distribution.

      Basics of Conjoint Analysis

      • Definition: A commonly used quantitative market research method that quantifies consumer preferences for products/services.

      • History: Widely used since the 1970s, particularly beneficial with the advent of "big data" and online technologies.

      Marketing Decisions Addressed by Conjoint Analysis

      1. Product Design:

        • Assess which features or combinations are most valued by customers.

        • Example Question: Should a smartwatch have GPS, music storage, or longer battery life?

      2. Pricing Strategy:

        • Evaluate how much customers are willing to pay for specific features (WTP).

        • Example Question: What is the optimal price for a premium version of a coffee maker?

      3. Market Segmentation:

        • Identify preference differences across groups.

        • Example Question: Do younger consumers value sustainability more than older ones?

      Foundations of Conjoint Analysis

      • Methodology:

        • Define products as collections of attributes and assess reactions to various alternatives.

        • Present product profiles consisting of product attributes.

        • Request individuals to judge products overall.

        • Utilize regression to uncover the underlying preference system.

      • Key Assumptions:

        1. Product represented as a bundle of attributes.

        2. Preference based on a sum of values of individual product attributes.

      Conjoint Analysis Process

      1. Identify relevant attributes.

      2. Gather conjoint data from respondents.

      3. Estimate part-worths.

      4. Analyze the output:

        • Attribute importance.

        • Willingness to pay.

        • Market simulation.

      Categorical and Continuous Measures

      • Categorical Measures: Frequently used expression for nominal and ordinal measures.

      • Continuous Measures: Frequently used expression for interval and ratio measures.

      Frequency Analysis

      • Definition: A count of the number of cases in each possible response category.

      Outliers

      • Definition: An observation significantly different in magnitude from the rest, treated specially by analysts.

      Histogram

      • Definition: A column chart illustrating values of a variable on the x-axis and the corresponding absolute or relative frequency on the y-axis.

      Confidence Interval

      • Definition: A projected range within which a population parameter is expected to lie at a given confidence level, derived from a statistic obtained from a probabilistic sample.

      Descriptive Statistics

      • Definition: Statistics that summarize the distribution of responses on a variable.

      • Commonly used types: Mean and standard deviation.

      Sample Mean $(ar{x})$

      • Definition: The arithmetic average of responses on a variable.

      Sample Standard Deviation $(s)$

      • Definition: A measure of variation of responses on a variable. Calculated as the square root of the variance on a variable.

      Median Split

      • Technique: Converts a continuous measure into a categorical measure with two approximately equal-sized groups by splitting at the median value.

      Cumulative Percentage Breakdown

      • Technique: Converts a continuous measure into a categorical measure based on cumulative percentages obtained from frequency analysis.

      Two-Box Technique

      • Technique: Converts an interval-level rating scale into a categorical measure for presentation purposes, reporting the percentage of respondents choosing the top two positions on a rating scale.

      Z-Score and Hypothesis Testing

      • Parameters:

        • $Z$: Z-score corresponding to confidence level (e.g., 1.96 for 95% confidence).

        • $S$: Sample standard deviation.

        • $N$: Total number of cases.

      Overview of Hypothesis Testing

      • Purpose: Establish standards for deciding if sample results represent the overall population.

      • Process: Begin with a hypothesis when preparing for research.

      • Application: Use inferential statistics to determine if empirical evidence confirms the hypothesis.

      Null Hypothesis ($H_0$)

      • Definition: The hypothesis that a proposed result is NOT true for the population.

      • Research aim: Reject $H_0$ in favor of an alternative hypothesis.

      Alternative Hypothesis ($H_a$)

      • Definition: The hypothesis that proposes the result is TRUE for the population.

      Rules of Hypothesis Testing

      • A null hypothesis can be rejected but not wholly accepted; further evidence may dispute it.

      • Frame $H0$ so its rejection suggests tentative acceptance of $Ha$.

      Significance Level (Alpha Level) ($\alpha$)

      • Definition: The probability of error selected by researchers, usually set at 0.05.

      • Signifies a 5% chance of rejecting a true null hypothesis.

      P-Value

      • Definition: Probability of obtaining a result if $H_0$ is true.

      • Observations: A result is significant if the p-value is less than the chosen significance level.

      Chi-Square Goodness-of-Fit Test

      • Purpose: Determine if observed frequencies align with expected patterns.

      • Application: Can be applied with only two levels per variable.

      Distinction between Analyses

      • Univariate Analyses: Assess individual variables.

      • Multivariate Analyses: Involve multiple variables to analyze relationships.

      Frequency Analysis

      • Definition: Count occurrences within response categories, used in univariate analysis.

      Descriptive Statistics for Continuous Measures

      • Common types: Mean (central tendency) and standard deviation (dispersion).

      Confidence Intervals for Proportions and Means

      • Definition: Projected range where true population proportions or means likely fall, at a designated confidence level (e.g., 95%).

      • Calculation: For means, mean $\pm$ estimated sampling error; for proportions, proportion $\pm$ estimated sampling error.

      Overview of Hypothesis Testing Purpose

      • Aim: Determine if findings from a sample can be generalized over the broader population.

      Cross Tabulation

      • Definition: A multivariate technique examining relationships between multiple categorical variables, analyzing joint distributions.

      Independent Samples t-Test for Means

      • Usage: Determine if two groups significantly differ in a characteristic assessed on a continuous measure.

      Paired Sample t-Test

      • Definition: Compares two means from scores collected from the same sample.

      Regression Analysis

      • Definition: Derives an equation detailing the influence of one or more independent variables on a continuous dependent variable.

      • Implementation: The dependent variable is regressed against a set of predictors.

      Importance of Multivariate Analysis

      • Provides deeper understanding compared to univariate analysis, focusing on differences across groups or associations between variables.

      Cross Tabulation's Purpose

      • To investigate relationships among categorical variables.

      Comparing Groups on Continuous Dependent Variable

      • The independent samples t-test is applied to compare means between two groups.

      Independent Sample vs. Paired Sample t-Test for Means

      • Independent Sample t-Test: Compares means across different respondent groups.

      • Paired Sample t-Test: Compares means of two different variables within the same group.

      Examining Influence of Predictor Variables

      • Utilize regression analysis to create a mathematical relationship between dependent and one or more predictor variables.

      Descriptive Research

      • Definition: Collecting secondary data standardized for concurrent use by multiple entities, primarily focusing on describing characteristics or relationships.

      Casual Research

      • Definition: Emphasizes establishing cause-and-effect relationships, requiring three conditions to be fulfilled:

        1. Consistent variance between cause and effect.

        2. Correct time order of cause and effect.

        3. Elimination of alternative explanations.

      Experiment Types

      • Laboratory Experiment: Controlled investigation with high internal validity.

      • Field Experiment: Conducted in realistic situations, balancing control and realism.

      Validity Considerations

      • Internal Validity: Reflects causation attribution to experimental variables.

      • External Validity: Refers to the generalizability of results across various situations.