R Studio, R Basics, Descriptive Analysis, Inference Analysis, Difference Analysis, Associative Analysis, & Regression Analysis Notes

R Studio

  • R Studio is an Integrated Development Environment (IDE) for R, providing a user-friendly interface.
  • R Studio has four main windows:
    • Script Files
    • Environment
    • Console
    • Misc

Script Files

  • Saves your script.
  • Allows code and comments.
  • Can have multiple files open at a time.

Environment

  • Holds your objects.
  • Allows you to review history.

Console/Command Line

  • Can be used as a calculator.
  • Does not save codes.
  • Output is displayed here.

Misc

  • Displays files in the working directory.
  • Plots data when produced.
  • Helps with searching of files.

Script Editor

  • Writes and saves R scripts.
  • Great for longer code blocks.

Key Features:

  • Syntax Highlighting: Easily distinguish code elements.
  • Code Completion: Speeds up coding by suggesting code options.
  • Multiple-File Editing: Switch between open scripts effortlessly.
  • Find/Replace: Quickly search and replace text in scripts.

Workspace Environment

  • Displays current R working environment, including user-defined objects.

Codes Used When Managing Objects

  • ls() – List all objects that the workspace environment has.
  • rm(x) – Removes certain elements from the workspace environment.
  • rm(list = ls()) – Removes all objects from the workspace environment.

User-Defined Objects:

  • Vectors
  • Matrices
  • Data Frames
  • Lists
  • Functions

Miscellaneous Displays

  • Files: Shows available files in your working directory.
  • Plots: Displays any plots or graphics generated by your code.
  • Packages: Lists all downloaded packages, including those currently loaded.
  • Help: Search for help topics or view help documentation for commands.

R Basics 2 (tidyverse)

  • Tidyverse is a collection of R packages designed for data science, allowing users to write simple, readable, and efficient codes. Essential for data-wrangling tasks.

Core Tidyverse Packages

  • Loading tidyverse: library(tidyverse) includes:
    • dplyr: Data Manipulation
    • ggplot2: Visualization
    • tidyr: Data tidying
    • readr: Data Import
    • tibble: Enhanced data frames
    • forcats: Categorical variable handling
    • stringr: String manipulation
    • purr: Functional Programming
    • lubridate: Date/Time Management
  • head(mpg) – shows the first six lines of a dataset.

Pipe Operator

  • Provided by tidyverse, the pipe operator (%>%) forwards a value or the result of an expression into the next expression.
  • Read as “and then”.

Why Use %>%?

  • Improves readability: Code flows from top to bottom, like natural reading; easier to debug and modify.
  • Reduces Complexity: Avoids deeply nested function calls.
  • Increases Maintainability: Each step of the operation is clear and self-contained.

Transforming Data with dplyr

  • dplyr, part of the tidyverse package, is designed for manipulating, sorting, summarizing, and joining data frames.
  • Uses clear and easy-to-read syntax, making data transformation faster and less error-prone.
  • select() – Reduces dataframe size to only desired variables for the current task.
  • mutate() – Creates new variables or new columns in existing data.

Why Data Visualization?

  • Helps understand patterns and trends.
  • Detecting Outliers.
  • Simplifying Complexity.
  • ggplot2 – Package used to construct charts and make plots of the data.

Descriptive Analysis

  • Provides a summary of data to create an overall picture (e.g., average customer ratings).

Coding Survey Responses

  • Before statistical analyses, survey responses must be coded to numeric numbers.

Types of Questions

  • Closed-Ended Questions: Responses are predefined, making data entry and analysis straightforward.
    • Assign Numeric Values: Each response option is given a specific numeric code (e.g., 1= yes, 2 = no).
  • Open-Ended Questions: Responses vary widely, creating a lengthy list of possible answers and are more complex to code.
    • Qualitative Analysis: Coding requires categorizing or grouping similar responses, which is time-consuming and subjective.

Purpose of Descriptive Analysis

  1. Provides an overview of the data, helping to summarize large datasets.
  2. Sets the foundation for deeper analysis and insight generation.

Two Key Types of Measures

  • Describe the Information obtained in a Sample.
    1. Measures of Central Tendency – Describe the “typical” respondent or response (e.g., mean, median, mode).
    2. Measures of Variability – Describe how similar or different respondents or responses are to the “typical” ones (e.g., range, variance, standard deviation).

Frequency/Percentage Distribution

  • A frequency distribution is a table that shows how often each unique value appears within a data set.
  • A percentage distribution is derived by dividing each frequency by the total number of observations and then converting it to a percentage.
  • Expresses the relative proportion of each value in the data set.

Range

  • Identifies the distance between the lowest value (minimum) and the highest value (maximum) in an ordered set of values.
  • Range = Maximum – Minimum

Standard Deviation

  • Measures how much values vary around the mean.
    • A low standard deviation means values are close to the mean, while a high standard deviation shows a greater spread.
    • 68% - values fall within one standard deviation of the mean
    • 95% - within two standard deviations of the mean
    • 99.7% - within three standard deviations of the mean

Inference Analysis

  • A population is the entire group analyzed in a dataset (e.g., all college students in the U.S.).
  • A sample is a smaller, representative group selected from the population (e.g., 1,000 college students).
  • Sample Statistics are values computed from a sample, using Roman letters.
  • A Population Parameter are true values for the entire population, using Greek letters.

Statistical Inference

  • Uses information from a sample to make educated guesses (inferences) about a population.
  • Sampling Error – Is the difference between the sampled results and the true population results.

Types of Statistical Inference

  • Parameter Estimation – Uses sample statistics and confidence intervals to approximate population parameters such as mean.
  • Hypothesis Testing – Compares sample statistics against hypothesized population values to validate assumptions.
    • Example: If Tesla assumes that less than 20% of consumers are dissatisfied with a product, hypothesis testing can help determine whether the sample data supports or rejects this assumption.

Difference Between Standard Error and Standard Deviation

  • Standard Deviation measures the spread of individual data points within a single dataset.
  • Standard Error measures how precisely a sample statistic (like the mean) estimates a population across multiple datasets.
  • NOTE: Standard Error decreases as sample size increases.
  • t.test function calculates both a 95% confidence interval and a mean.

Difference Analysis

  • Consumer Heterogeneity – Shows that not all customers are the same; people have different preferences and needs.
  • Market Segmentation – Divides an overall market into groups with similar needs and desires for a particular product or service category.

Common criteria for segmentation for discovering differences:

  • Statistically Significant
  • Meaningful
  • Stable
  • Actionable

Statistically Significant

  • Differences found in the sample truly exist in the population.
    • Example: Cold sufferers grouped by symptoms.
      • Congestion sufferers desire breathing relief.
      • Muscle ache sufferers desire pain relief.
  • NOTE: t.test and ANOVA are statistical tests used to provide evidence as to whether the observed differences are statistically significant.

Meaningful Differences

  • Marketing managers can use to make better decisions.
    • Example: Should a pharmaceutical company include both ingredients in one remedy?
  • NOTE: Just because the differences are statistically significant doesn’t mean it is meaningful.
    • Congestion sufferers: Avoid remedies with drowsy-inducing pain relief ingredients.
    • Pain relief seekers: Avoid remedies causing throat or nasal dryness (side effects).

Stable Differences

  • Persist over time and are not short-term or temporary.
    • Example: “Cold sufferers suffer regularly same symptoms”.

Actionable Differences

  • Enables marketers to design strategies that effectively target distinct consumer segments.
    • Congestion Sufferers: Focus on breathing relief products.
    • Pain Relief Seekers: Focus on remedies for muscle aches.
  • prop.test function is used to calculate chi-square statistics instead of z-value. It requires two arguments: number of successes and total sample sizes.
Successes
  • Represent the number of individuals or cases in a sample based on the variable being observed.

Analysis of Variance (ANOVA)

  • Used to compare the means of three or more groups.
  • Examines the differences between group means to determine whether they vary significantly from one another.
  • H_0: There is no difference between the population means.
  • H_a: At least one group mean differs.
  • NOTE: ANOVA uses the F-test which is more efficient and reduces the risk of false positives (Type I errors) when testing multiple groups.
    • If the p-value is less than 0.05 (p<0.05), reject the null hypothesis H_0, indicating at least one group mean differs significantly from the others.
    • If the p-value is greater than or equal to 0.05, fail to reject the null hypothesis, indicating no significant difference among the group means.
  • NOTE: If the null hypothesis is rejected, conduct a post-hoc test to determine which group is significantly different than the others and by how much. Tukey’s HSD is an example of a post-hoc test.

Associative Analysis

  • Determines where stable relationships exist between two variables.

Relationship between Two Variables

  • A relationship is consistent, with systematic linkage between the levels or labels for two variables.
    • Levels – Refer to the characteristics used to describe interval or ratio scales. (e.g., older consumers purchase more vitamins).
    • Labels – Refer to the characteristics used to describe nominal or ordinal scales. (e.g., PayPal customers (Yes/No) tend to also be Amazon Prime customers).
Types of Linkage between Variables:
  • Statistical Linkage – Indicates a consistent pattern or association between variables, but not causation (causes) (e.g., daily exercises purchase sports drinks which suggests a correlation but does not prove that exercising causes sports drink purchases).
  • Causal Linkage – Requires certainty that one variable causes the other (e.g., increased advertising spending leads to higher sales).
Monotonic Linear Relationship
  • A monotonic relationship is a straight-line association between two scale variables.
  • Formula: y = a + bx
    • Example: Burger King estimates that every customer will spend about $12 per lunch.
    • y = 0 + 12x, where x is the number of customers.
      • 100 customers = 12x100 = 1,200 revenues.
Nonmonotonic Relationship
  • The presence or absence of one variable’s label is systematically associated with the presence or absence of another variable’s label.
    • Example: the type of drink a customer orders in the morning (coffee) and at noon (soft drink).
  • NOTE: Nonmonotonic relationships are often used for the analysis of nominal scale variables.

Characterizing Relationship between Variables

  • Presence: Whether a systematic (statistical) relationship exists between two variables.
  • Pattern: The general nature of the relationship, including its direction.
  • Strength of Association: Whether the relationship is consistent.

Step-By-Step procedure for analyzing the relationship between two variables

  • Step 1 Choose variables to analyze: Identify two variables that might be related.
  • Step 2 – Determine the scaling assumption of the chosen variable: Either nominal, ordinal, interval, or ratio.
  • Step 3 – Use the correct relationship analysis: If the two scales are interval or ratio, use correlation relationship; if the two scales are nominal, use cross-tabulation.
  • Step 4 – Determine if the relationship is present: If the analysis is significant (typically at 95%) it is present.
  • Step 5 – If present, determine the direction of the relationship
  • Step 6 – If present, assess the strength of the relationship: With correlation, the size of the coefficient denotes strength; with cross-tabulation, use Cramer’s V.

Correlation Coefficients

  • Quantifies the strength and direction of a linear relationship between two scale variables, ranging from -1.0 to +1.0.
  • Negative association – x increases, y decreases
  • Positive association – x increases, y increases

Pearson Product Moment Correlation Coefficient

  • Measures the linear relationship between two interval or ratio-scaled variables.
  • Psych package is used during calculation.
  • corr.test () function in the psych package computes both the correlation coefficient and the p-value in a single step.
  • Results$stars function displays the correlation coefficient with statistical significance (p-value). It shows the correlation matrix with stars.
  • NOTE: 2 stars or 3 stars often means strong statistical significance.

Cross Tabulation Analysis

  • A statistical method used to examine nonmonotonic relationships between two nominally scaled variables.
  • A cross-tabulation table is sometimes referred to as an “rxc” table because it consists of rows, columns, and cells.

Chi-Square Analysis

  • A statistical analysis technique used to examine the frequencies of two nominally scaled variables in a cross-tabulation table.
Degrees of Freedom in Chi-Square Analysis
  • Represent the number of values in a calculation that are free to vary while still meeting the constraints of the data.

Characteristics of the Chi-Square Distribution

  • The Chi-Square Distribution is skewed to the right which means it is not symmetrical.
  • The skewness decreases as the degrees of freedom increases.
  • For large degrees of freedom, the Chi-Square distribution approximates a normal distribution.
  • The rejection region is always located in the right-hand tail of the distribution.
  • Larger Chi-Square values fall in the rejection region, indicating a significant result.
  • tabyl() function within the janitor package is used for cross-tabulation.

Regression Analysis

  • A technique where one or more variables are used to predict the level of another using the straight-line formula.

Bivariate/Simple Regression

  • When two variables are involved.
  • Treats only one dependent and independent variable at a time.

Multiple Regression Analysis

  • Allows simultaneous analysis of multiple independent variables.
  • It is more efficient and comprehensive for complex problems.
  • Multiple Regression equations have two or more independent variables.

Understanding R squared

  • Also referred to as “Coefficient of Determination,” represents the proportion of the variance in the dependent variable that is explained by the independent variable.
  • Explains how well the regression model fits the data.
  • R square ranges from 0 to 1
    • 0: None of the variations is explained by the model
    • 1: All variations are perfectly explained by the model
  • For example, an R square of 0.75 means 75% of the variability in the dependent variable is explained by the independent variable. The remaining 25% remains unexplained.
  • NOTE: Higher R square values indicate a good fit, while low r square values indicate a poor fit.

Multicollinearity

  • Occurs when independent variables are moderately or strongly correlated.
  • Results in invalid multiple regression findings.
Detecting Multicollinearity
  • One common method to detect multicollinearity is Variance Inflation Factor (VIF).
VIF
  • A measure of how much the variance of a regression coefficient is inflated (increased) due to multicollinearity.
    • VIF less than 10 (<10): No significant multicollinearity.
    • VIF greater than 10 (>10): Problematic multicollinearity; variable must be reconsidered or removed from the model.
  • lm() function performs the linear regression model.
  • vif() function in the car package is used to check for multicollinearity.
    • vif(full_model) will show VIF values for each variable.

“Trimming” the Regression for Significant Findings

  • Systematically eliminating insignificant independent variables from a regression model, resulting in a simplified model with only statistically significant predictors.
  • Removing variables with p-value higher than 0.05 (p>0.05).

Why Trim?

  • Reduces the risk of overfitting (Too complex).
  • Improves model interpretability.
  • Ensures the regression focuses only on meaningful relationships.

Trimming Process

  1. Run Initial Model (Includes all independent variables).
  2. Identify Insignificant Variables (Check p-value for each variable; insignificant if (p>0.05)).
  3. Eliminate Insignificant Variables (Remove one insignificant variable at a time, starting with the least significant variable).
  4. Re-run the model (re-check for any other insignificant variable).
  5. Stop when all remaining variables have a (p<0.05). (Significant variables).
  • The step() function is used to assist with the trimming process in R.
  • The step() function performs stepwise model selection by simultaneously adding or removing predictors to optimize a model’s goodness of fit, based on criteria like AIC or BIC.