Lecture Notes on Statistical Analysis and Research Methods
Categorical Variables
Definition: Represent distinct groups or categories.
Examples:
Gender
Types of products
Education level
Measurement Scale: Nominal, Ordinal
Analysis Method: Frequency analysis
Continuous Variables
Definition: Measured on a continuous numerical scale.
Examples:
Height
Weight
Temperature
Income
Measurement Scale: Interval, Ratio
Analysis Method: Descriptive statistics
Univariate Exploratory Data Analysis (EDA)
Components:
Frequency or descriptive statistics
Data visualizations
Goals:
Describe data
Understand the distribution
Identify outliers
Bivariate Analysis
Definition: Examines the relationship between two variables
Methods for Two Categorical Variables:
Summarization Method: Cross-tabulation
Hypothesis Testing Method: Chi-square test of independence
Bivariate Analysis (One Continuous & One Categorical)
Summarization Method: Comparing means between groups
Hypothesis Testing Methods:
Independent samples t-test
ANOVA
Correlation
Definition: Correlation $(p)$ measures the strength of the linear relationship between two continuous variables.
Range of Values: Always between -1 and 1
Positive Correlation
If correlation is greater than 0:
Statistical Analysis Methods:
Therapy pool usage and visit frequency: Independent samples t-test
Visit frequency and membership fees: Correlation coefficient
Median age and therapy pool usage: Chi-square test
Correlation vs. Regression
Correlation: Measures strength of linear relationship between two continuous variables.
Correlation coefficient ranges from -1 to 1
Regression: Aims to establish a linear equation between variables.
Involves estimating both intercept and slope parameters.
Independent variables do not need to be continuous.
A linear equation can predict the outcome variable; correlation cannot be used for prediction.
Regression Analysis
Uses:
To assess if there's a statistically significant relationship between variables.
To measure the effect of one variable on another.
To predict the outcome variable.
To control for other variables when making assessments.
Applications:
Understanding how predictor variables influence dependent variables.
Predicting/forecasting dependent variables based on predictor variables.
Examples in Regression Analysis
Price and demand data
AFC data
Steps in Regression Analysis
Plot scatter diagram.
Formulate general model.
Estimate parameters.
Estimate regression coefficients.
Test for significance.
Determine strength and significance of association.
Examine residuals.
Evaluating Regression Significance
F-Test for Model Validity
Questions:
Is the overall regression model significant?
What is the probability that at least one of the estimated coefficients differs from zero?
Null Hypotheses ($H0$): $B{intercept} = 0$
Alternative Hypotheses ($Ha$): At least one of $\beta{intercept}$ or $\beta_{price}$ is not zero.
Test Statistic Formula: F = \frac{MS{regression}}{MS{residual}} Where:
$MS_{regression}$ = Sum of Squares Error (SSE) for regression / degrees of freedom for regression (dfr).
$MS_{residual}$ = SSE residual / degrees of freedom residual (dfres).
Decision Rule: Reject $H_0$ if $p < 0.05$.
t-Test for Price Coefficient Validity
Questions:
Does a specific predictor have a significant effect on the dependent variable?
Example of Demand Relationship: $Demand = a + b1 * Pricei + e_i$, where $i$ refers to the observation.
Two-tailed Test:
$H0$: $\beta{price} = 0$
$Ha$: $\beta{price} \neq 0$
One-tailed Test:
$H0$: $\beta{price} = 0$
$Ha$: $\beta{price} < 0$
$p_{price} = 0$
Test Statistic Formula:
t = \frac{p{price}}{SE{price}}Decision Rule: Reject $H_0$ if $p < 0.05$.
Predicting Demand and Profit with Linear Fit
Equation:
Demand = 10.127938 - 0.8955541 * PriceSummary of Fit:
R-Squared: $0.867979$
Adjusted R-Squared: $0.863578$
Root Mean Square Error: $0.618313$
Mean of Response: $6.1875$
Observations: $32$
Lack of Fit Analysis of Variance
Analysis of Variance Breakdown:
Source
DF
Squares
Mean Square
F Ratio
Model
1
75.405658
75.4057
197.2362
Error
30
11.469342
-
-
Total
31
86.875000
-
-
Statistical Significance: $Prob > F < 0.0001$.
Parameter Estimates
Estimation Table:
Term
Estimate
Std Error
t Ratio
P>
t
Intercept
10.127938
0.301115
33.63
<0.0001
Price ($)
-0.895554
0.063767
-14.04
<0.0001
Marketing Decisions Using Regression Model
Strategy: Find price point that maximizes predicted profit.
Predicted Profit Formula: Predicted\,Profit = (Price - MC) * (Predicted\,Demand)
If $MC = 0$, then:
Predicted\,Profit = Price\,*\,Predicted\,Demand
Use estimated regression equation to obtain predicted demand:
Demand = 10.128 - 0.896 * Price
Strategies for Missing Data
Drop missing observations:
Acceptable if remaining observations represent the population for the dependent variable.
Replace missing observations with zeros:
Overstates occurrence of zero.
Replace missing observations with averages:
More realistic, but may distort middle of distribution.
Basics of Conjoint Analysis
Definition: A commonly used quantitative market research method that quantifies consumer preferences for products/services.
History: Widely used since the 1970s, particularly beneficial with the advent of "big data" and online technologies.
Marketing Decisions Addressed by Conjoint Analysis
Product Design:
Assess which features or combinations are most valued by customers.
Example Question: Should a smartwatch have GPS, music storage, or longer battery life?
Pricing Strategy:
Evaluate how much customers are willing to pay for specific features (WTP).
Example Question: What is the optimal price for a premium version of a coffee maker?
Market Segmentation:
Identify preference differences across groups.
Example Question: Do younger consumers value sustainability more than older ones?
Foundations of Conjoint Analysis
Methodology:
Define products as collections of attributes and assess reactions to various alternatives.
Present product profiles consisting of product attributes.
Request individuals to judge products overall.
Utilize regression to uncover the underlying preference system.
Key Assumptions:
Product represented as a bundle of attributes.
Preference based on a sum of values of individual product attributes.
Conjoint Analysis Process
Identify relevant attributes.
Gather conjoint data from respondents.
Estimate part-worths.
Analyze the output:
Attribute importance.
Willingness to pay.
Market simulation.
Categorical and Continuous Measures
Categorical Measures: Frequently used expression for nominal and ordinal measures.
Continuous Measures: Frequently used expression for interval and ratio measures.
Frequency Analysis
Definition: A count of the number of cases in each possible response category.
Outliers
Definition: An observation significantly different in magnitude from the rest, treated specially by analysts.
Histogram
Definition: A column chart illustrating values of a variable on the x-axis and the corresponding absolute or relative frequency on the y-axis.
Confidence Interval
Definition: A projected range within which a population parameter is expected to lie at a given confidence level, derived from a statistic obtained from a probabilistic sample.
Descriptive Statistics
Definition: Statistics that summarize the distribution of responses on a variable.
Commonly used types: Mean and standard deviation.
Sample Mean $(ar{x})$
Definition: The arithmetic average of responses on a variable.
Sample Standard Deviation $(s)$
Definition: A measure of variation of responses on a variable. Calculated as the square root of the variance on a variable.
Median Split
Technique: Converts a continuous measure into a categorical measure with two approximately equal-sized groups by splitting at the median value.
Cumulative Percentage Breakdown
Technique: Converts a continuous measure into a categorical measure based on cumulative percentages obtained from frequency analysis.
Two-Box Technique
Technique: Converts an interval-level rating scale into a categorical measure for presentation purposes, reporting the percentage of respondents choosing the top two positions on a rating scale.
Z-Score and Hypothesis Testing
Parameters:
$Z$: Z-score corresponding to confidence level (e.g., 1.96 for 95% confidence).
$S$: Sample standard deviation.
$N$: Total number of cases.
Overview of Hypothesis Testing
Purpose: Establish standards for deciding if sample results represent the overall population.
Process: Begin with a hypothesis when preparing for research.
Application: Use inferential statistics to determine if empirical evidence confirms the hypothesis.
Null Hypothesis ($H_0$)
Definition: The hypothesis that a proposed result is NOT true for the population.
Research aim: Reject $H_0$ in favor of an alternative hypothesis.
Alternative Hypothesis ($H_a$)
Definition: The hypothesis that proposes the result is TRUE for the population.
Rules of Hypothesis Testing
A null hypothesis can be rejected but not wholly accepted; further evidence may dispute it.
Frame $H0$ so its rejection suggests tentative acceptance of $Ha$.
Significance Level (Alpha Level) ($\alpha$)
Definition: The probability of error selected by researchers, usually set at 0.05.
Signifies a 5% chance of rejecting a true null hypothesis.
P-Value
Definition: Probability of obtaining a result if $H_0$ is true.
Observations: A result is significant if the p-value is less than the chosen significance level.
Chi-Square Goodness-of-Fit Test
Purpose: Determine if observed frequencies align with expected patterns.
Application: Can be applied with only two levels per variable.
Distinction between Analyses
Univariate Analyses: Assess individual variables.
Multivariate Analyses: Involve multiple variables to analyze relationships.
Frequency Analysis
Definition: Count occurrences within response categories, used in univariate analysis.
Descriptive Statistics for Continuous Measures
Common types: Mean (central tendency) and standard deviation (dispersion).
Confidence Intervals for Proportions and Means
Definition: Projected range where true population proportions or means likely fall, at a designated confidence level (e.g., 95%).
Calculation: For means, mean $\pm$ estimated sampling error; for proportions, proportion $\pm$ estimated sampling error.
Overview of Hypothesis Testing Purpose
Aim: Determine if findings from a sample can be generalized over the broader population.
Cross Tabulation
Definition: A multivariate technique examining relationships between multiple categorical variables, analyzing joint distributions.
Independent Samples t-Test for Means
Usage: Determine if two groups significantly differ in a characteristic assessed on a continuous measure.
Paired Sample t-Test
Definition: Compares two means from scores collected from the same sample.
Regression Analysis
Definition: Derives an equation detailing the influence of one or more independent variables on a continuous dependent variable.
Implementation: The dependent variable is regressed against a set of predictors.
Importance of Multivariate Analysis
Provides deeper understanding compared to univariate analysis, focusing on differences across groups or associations between variables.
Cross Tabulation's Purpose
To investigate relationships among categorical variables.
Comparing Groups on Continuous Dependent Variable
The independent samples t-test is applied to compare means between two groups.
Independent Sample vs. Paired Sample t-Test for Means
Independent Sample t-Test: Compares means across different respondent groups.
Paired Sample t-Test: Compares means of two different variables within the same group.
Examining Influence of Predictor Variables
Utilize regression analysis to create a mathematical relationship between dependent and one or more predictor variables.
Descriptive Research
Definition: Collecting secondary data standardized for concurrent use by multiple entities, primarily focusing on describing characteristics or relationships.
Casual Research
Definition: Emphasizes establishing cause-and-effect relationships, requiring three conditions to be fulfilled:
Consistent variance between cause and effect.
Correct time order of cause and effect.
Elimination of alternative explanations.
Experiment Types
Laboratory Experiment: Controlled investigation with high internal validity.
Field Experiment: Conducted in realistic situations, balancing control and realism.
Validity Considerations
Internal Validity: Reflects causation attribution to experimental variables.
External Validity: Refers to the generalizability of results across various situations.