R Studio, R Basics, Descriptive Analysis, Inference Analysis, Difference Analysis, Associative Analysis, & Regression Analysis Notes
R Studio
- R Studio is an Integrated Development Environment (IDE) for R, providing a user-friendly interface.
- R Studio has four main windows:
- Script Files
- Environment
- Console
- Misc
Script Files
- Saves your script.
- Allows code and comments.
- Can have multiple files open at a time.
Environment
- Holds your objects.
- Allows you to review history.
Console/Command Line
- Can be used as a calculator.
- Does not save codes.
- Output is displayed here.
Misc
- Displays files in the working directory.
- Plots data when produced.
- Helps with searching of files.
Script Editor
- Writes and saves R scripts.
- Great for longer code blocks.
Key Features:
- Syntax Highlighting: Easily distinguish code elements.
- Code Completion: Speeds up coding by suggesting code options.
- Multiple-File Editing: Switch between open scripts effortlessly.
- Find/Replace: Quickly search and replace text in scripts.
Workspace Environment
- Displays current R working environment, including user-defined objects.
Codes Used When Managing Objects
ls() – List all objects that the workspace environment has.rm(x) – Removes certain elements from the workspace environment.rm(list = ls()) – Removes all objects from the workspace environment.
User-Defined Objects:
- Vectors
- Matrices
- Data Frames
- Lists
- Functions
Miscellaneous Displays
- Files: Shows available files in your working directory.
- Plots: Displays any plots or graphics generated by your code.
- Packages: Lists all downloaded packages, including those currently loaded.
- Help: Search for help topics or view help documentation for commands.
R Basics 2 (tidyverse)
- Tidyverse is a collection of R packages designed for data science, allowing users to write simple, readable, and efficient codes. Essential for data-wrangling tasks.
Core Tidyverse Packages
- Loading tidyverse:
library(tidyverse) includes:- dplyr: Data Manipulation
- ggplot2: Visualization
- tidyr: Data tidying
- readr: Data Import
- tibble: Enhanced data frames
- forcats: Categorical variable handling
- stringr: String manipulation
- purr: Functional Programming
- lubridate: Date/Time Management
head(mpg) – shows the first six lines of a dataset.
Pipe Operator
- Provided by tidyverse, the pipe operator (
%>%) forwards a value or the result of an expression into the next expression. - Read as “and then”.
Why Use %>%?
- Improves readability: Code flows from top to bottom, like natural reading; easier to debug and modify.
- Reduces Complexity: Avoids deeply nested function calls.
- Increases Maintainability: Each step of the operation is clear and self-contained.
- dplyr, part of the tidyverse package, is designed for manipulating, sorting, summarizing, and joining data frames.
- Uses clear and easy-to-read syntax, making data transformation faster and less error-prone.
select() – Reduces dataframe size to only desired variables for the current task.mutate() – Creates new variables or new columns in existing data.
Why Data Visualization?
- Helps understand patterns and trends.
- Detecting Outliers.
- Simplifying Complexity.
ggplot2 – Package used to construct charts and make plots of the data.
Descriptive Analysis
- Provides a summary of data to create an overall picture (e.g., average customer ratings).
Coding Survey Responses
- Before statistical analyses, survey responses must be coded to numeric numbers.
Types of Questions
- Closed-Ended Questions: Responses are predefined, making data entry and analysis straightforward.
- Assign Numeric Values: Each response option is given a specific numeric code (e.g., 1= yes, 2 = no).
- Open-Ended Questions: Responses vary widely, creating a lengthy list of possible answers and are more complex to code.
- Qualitative Analysis: Coding requires categorizing or grouping similar responses, which is time-consuming and subjective.
Purpose of Descriptive Analysis
- Provides an overview of the data, helping to summarize large datasets.
- Sets the foundation for deeper analysis and insight generation.
Two Key Types of Measures
- Describe the Information obtained in a Sample.
- Measures of Central Tendency – Describe the “typical” respondent or response (e.g., mean, median, mode).
- Measures of Variability – Describe how similar or different respondents or responses are to the “typical” ones (e.g., range, variance, standard deviation).
Frequency/Percentage Distribution
- A frequency distribution is a table that shows how often each unique value appears within a data set.
- A percentage distribution is derived by dividing each frequency by the total number of observations and then converting it to a percentage.
- Expresses the relative proportion of each value in the data set.
Range
- Identifies the distance between the lowest value (minimum) and the highest value (maximum) in an ordered set of values.
- Range = Maximum – Minimum
Standard Deviation
- Measures how much values vary around the mean.
- A low standard deviation means values are close to the mean, while a high standard deviation shows a greater spread.
- 68% - values fall within one standard deviation of the mean
- 95% - within two standard deviations of the mean
- 99.7% - within three standard deviations of the mean
Inference Analysis
- A population is the entire group analyzed in a dataset (e.g., all college students in the U.S.).
- A sample is a smaller, representative group selected from the population (e.g., 1,000 college students).
- Sample Statistics are values computed from a sample, using Roman letters.
- A Population Parameter are true values for the entire population, using Greek letters.
Statistical Inference
- Uses information from a sample to make educated guesses (inferences) about a population.
- Sampling Error – Is the difference between the sampled results and the true population results.
Types of Statistical Inference
- Parameter Estimation – Uses sample statistics and confidence intervals to approximate population parameters such as mean.
- Hypothesis Testing – Compares sample statistics against hypothesized population values to validate assumptions.
- Example: If Tesla assumes that less than 20% of consumers are dissatisfied with a product, hypothesis testing can help determine whether the sample data supports or rejects this assumption.
Difference Between Standard Error and Standard Deviation
- Standard Deviation measures the spread of individual data points within a single dataset.
- Standard Error measures how precisely a sample statistic (like the mean) estimates a population across multiple datasets.
- NOTE: Standard Error decreases as sample size increases.
t.test function calculates both a 95% confidence interval and a mean.
Difference Analysis
- Consumer Heterogeneity – Shows that not all customers are the same; people have different preferences and needs.
- Market Segmentation – Divides an overall market into groups with similar needs and desires for a particular product or service category.
Common criteria for segmentation for discovering differences:
- Statistically Significant
- Meaningful
- Stable
- Actionable
Statistically Significant
- Differences found in the sample truly exist in the population.
- Example: Cold sufferers grouped by symptoms.
- Congestion sufferers desire breathing relief.
- Muscle ache sufferers desire pain relief.
- NOTE:
t.test and ANOVA are statistical tests used to provide evidence as to whether the observed differences are statistically significant.
Meaningful Differences
- Marketing managers can use to make better decisions.
- Example: Should a pharmaceutical company include both ingredients in one remedy?
- NOTE: Just because the differences are statistically significant doesn’t mean it is meaningful.
- Congestion sufferers: Avoid remedies with drowsy-inducing pain relief ingredients.
- Pain relief seekers: Avoid remedies causing throat or nasal dryness (side effects).
Stable Differences
- Persist over time and are not short-term or temporary.
- Example: “Cold sufferers suffer regularly same symptoms”.
Actionable Differences
- Enables marketers to design strategies that effectively target distinct consumer segments.
- Congestion Sufferers: Focus on breathing relief products.
- Pain Relief Seekers: Focus on remedies for muscle aches.
prop.test function is used to calculate chi-square statistics instead of z-value. It requires two arguments: number of successes and total sample sizes.
Successes
- Represent the number of individuals or cases in a sample based on the variable being observed.
Analysis of Variance (ANOVA)
- Used to compare the means of three or more groups.
- Examines the differences between group means to determine whether they vary significantly from one another.
- H_0: There is no difference between the population means.
- H_a: At least one group mean differs.
- NOTE: ANOVA uses the F-test which is more efficient and reduces the risk of false positives (Type I errors) when testing multiple groups.
- If the p-value is less than 0.05 (p<0.05), reject the null hypothesis H_0, indicating at least one group mean differs significantly from the others.
- If the p-value is greater than or equal to 0.05, fail to reject the null hypothesis, indicating no significant difference among the group means.
- NOTE: If the null hypothesis is rejected, conduct a post-hoc test to determine which group is significantly different than the others and by how much. Tukey’s HSD is an example of a post-hoc test.
Associative Analysis
- Determines where stable relationships exist between two variables.
Relationship between Two Variables
- A relationship is consistent, with systematic linkage between the levels or labels for two variables.
- Levels – Refer to the characteristics used to describe interval or ratio scales. (e.g., older consumers purchase more vitamins).
- Labels – Refer to the characteristics used to describe nominal or ordinal scales. (e.g., PayPal customers (Yes/No) tend to also be Amazon Prime customers).
Types of Linkage between Variables:
- Statistical Linkage – Indicates a consistent pattern or association between variables, but not causation (causes) (e.g., daily exercises purchase sports drinks which suggests a correlation but does not prove that exercising causes sports drink purchases).
- Causal Linkage – Requires certainty that one variable causes the other (e.g., increased advertising spending leads to higher sales).
Monotonic Linear Relationship
- A monotonic relationship is a straight-line association between two scale variables.
- Formula: y = a + bx
- Example: Burger King estimates that every customer will spend about $12 per lunch.
- y = 0 + 12x, where x is the number of customers.
- 100 customers = 12x100 = 1,200 revenues.
Nonmonotonic Relationship
- The presence or absence of one variable’s label is systematically associated with the presence or absence of another variable’s label.
- Example: the type of drink a customer orders in the morning (coffee) and at noon (soft drink).
- NOTE: Nonmonotonic relationships are often used for the analysis of nominal scale variables.
Characterizing Relationship between Variables
- Presence: Whether a systematic (statistical) relationship exists between two variables.
- Pattern: The general nature of the relationship, including its direction.
- Strength of Association: Whether the relationship is consistent.
Step-By-Step procedure for analyzing the relationship between two variables
- Step 1 Choose variables to analyze: Identify two variables that might be related.
- Step 2 – Determine the scaling assumption of the chosen variable: Either nominal, ordinal, interval, or ratio.
- Step 3 – Use the correct relationship analysis: If the two scales are interval or ratio, use correlation relationship; if the two scales are nominal, use cross-tabulation.
- Step 4 – Determine if the relationship is present: If the analysis is significant (typically at 95%) it is present.
- Step 5 – If present, determine the direction of the relationship
- Step 6 – If present, assess the strength of the relationship: With correlation, the size of the coefficient denotes strength; with cross-tabulation, use Cramer’s V.
Correlation Coefficients
- Quantifies the strength and direction of a linear relationship between two scale variables, ranging from -1.0 to +1.0.
- Negative association – x increases, y decreases
- Positive association – x increases, y increases
Pearson Product Moment Correlation Coefficient
- Measures the linear relationship between two interval or ratio-scaled variables.
- Psych package is used during calculation.
corr.test () function in the psych package computes both the correlation coefficient and the p-value in a single step.Results$stars function displays the correlation coefficient with statistical significance (p-value). It shows the correlation matrix with stars.- NOTE: 2 stars or 3 stars often means strong statistical significance.
Cross Tabulation Analysis
- A statistical method used to examine nonmonotonic relationships between two nominally scaled variables.
- A cross-tabulation table is sometimes referred to as an “rxc” table because it consists of rows, columns, and cells.
Chi-Square Analysis
- A statistical analysis technique used to examine the frequencies of two nominally scaled variables in a cross-tabulation table.
Degrees of Freedom in Chi-Square Analysis
- Represent the number of values in a calculation that are free to vary while still meeting the constraints of the data.
Characteristics of the Chi-Square Distribution
- The Chi-Square Distribution is skewed to the right which means it is not symmetrical.
- The skewness decreases as the degrees of freedom increases.
- For large degrees of freedom, the Chi-Square distribution approximates a normal distribution.
- The rejection region is always located in the right-hand tail of the distribution.
- Larger Chi-Square values fall in the rejection region, indicating a significant result.
tabyl() function within the janitor package is used for cross-tabulation.
Regression Analysis
- A technique where one or more variables are used to predict the level of another using the straight-line formula.
Bivariate/Simple Regression
- When two variables are involved.
- Treats only one dependent and independent variable at a time.
Multiple Regression Analysis
- Allows simultaneous analysis of multiple independent variables.
- It is more efficient and comprehensive for complex problems.
- Multiple Regression equations have two or more independent variables.
Understanding R squared
- Also referred to as “Coefficient of Determination,” represents the proportion of the variance in the dependent variable that is explained by the independent variable.
- Explains how well the regression model fits the data.
- R square ranges from 0 to 1
- 0: None of the variations is explained by the model
- 1: All variations are perfectly explained by the model
- For example, an R square of 0.75 means 75% of the variability in the dependent variable is explained by the independent variable. The remaining 25% remains unexplained.
- NOTE: Higher R square values indicate a good fit, while low r square values indicate a poor fit.
Multicollinearity
- Occurs when independent variables are moderately or strongly correlated.
- Results in invalid multiple regression findings.
Detecting Multicollinearity
- One common method to detect multicollinearity is Variance Inflation Factor (VIF).
VIF
- A measure of how much the variance of a regression coefficient is inflated (increased) due to multicollinearity.
- VIF less than 10 (<10): No significant multicollinearity.
- VIF greater than 10 (>10): Problematic multicollinearity; variable must be reconsidered or removed from the model.
lm() function performs the linear regression model.vif() function in the car package is used to check for multicollinearity.vif(full_model) will show VIF values for each variable.
“Trimming” the Regression for Significant Findings
- Systematically eliminating insignificant independent variables from a regression model, resulting in a simplified model with only statistically significant predictors.
- Removing variables with p-value higher than 0.05 (p>0.05).
Why Trim?
- Reduces the risk of overfitting (Too complex).
- Improves model interpretability.
- Ensures the regression focuses only on meaningful relationships.
Trimming Process
- Run Initial Model (Includes all independent variables).
- Identify Insignificant Variables (Check p-value for each variable; insignificant if (p>0.05)).
- Eliminate Insignificant Variables (Remove one insignificant variable at a time, starting with the least significant variable).
- Re-run the model (re-check for any other insignificant variable).
- Stop when all remaining variables have a (p<0.05). (Significant variables).
- The
step() function is used to assist with the trimming process in R. - The
step() function performs stepwise model selection by simultaneously adding or removing predictors to optimize a model’s goodness of fit, based on criteria like AIC or BIC.