Descriptive Analytics in Information Systems and Supply Chain Management
INFORMATION SYSTEMS AND SUPPLY CHAIN MANAGEMENT
WEEK 5: DESCRIPTIVE ANALYTICS
Chapter 4
Course Drawn from Loyola University Chicago
4.1 Purpose of Descriptive Statistics
Descriptive statistics are used to summarize and describe the characteristics of a dataset.
Central Tendency Measures: Indicate where data points are centered.
Mean: The average value.
Calculation:
Mean is computed as ext{Mean} = rac{ ext{Sum of all values}}{ ext{Number of observations}}.
Limitation: Sensitive to extreme values (outliers).
Median: The middle value in a sorted dataset. (Second quartile or Q2)
Property: Robust to outliers; altering extreme values does not affect it.
Mode: The most frequently occurring value in the dataset.
Measures of Dispersion
Dispersion Measures: Indicate the spread of data points.
Range: Difference between the maximum and minimum values in the dataset.
Calculation: ext{Range} = ext{Max} - ext{Min}.
Interquartile Range (IQR): Difference between first quartile (Q1) and third quartile (Q3); indicates the middle 50% of the data.
Calculation: ext{IQR} = Q3 - Q1.
Standard Deviation: Indicates how spread out data points are around the mean.
Additional Descriptive Aspects
Frequency Distribution: Shows how often each value occurs in the dataset.
Skewness: Measures the asymmetry of the distribution of data.
Percentiles: Values below which a given percentage of observations fall.
Shape: The overall distribution pattern observed in the data.
4.2 Commands in R
General Statistics Commands
General R commands for data analysis in R:
nrow(): Provides total number of rows in a data frame.
Example:
nrow(real_estate)returns the number of rows in the real_estate data frame.ncol(): Provides total number of columns in a data frame.
Example:
ncol(real_estate)returns the number of columns in the real_estate data frame.
Head and Tail Functions
head(): Displays the first
nobservations of a data frame.Syntax:
head(<data frame object>, n=<number of rows to show>)
tail(): Displays the last
nobservations in a data frame.Syntax:
tail(<data frame object>, n=<number of rows to show>)
Unique Values Command
unique(): Returns distinct values from a specified column.
Example:
unique(real_estate$Location)Can be used to create a new data frame without duplicates:
Syntax:
<new data frame object name> = unique(<data frame object>)
Summary Statistics
summary(): Returns essential summary statistics:
Outputs include Minimum (Min.), First Quartile (1st Qu.), Median (Q2), Mean, Third Quartile (3rd Qu.), Maximum values.
Together these represent the 5-number summary, useful for boxplots and skewness analysis.
Syntax:
summary(<data frame object>)
To get summary for a specific column:
Syntax:
summary(<data frame name>$<column name>)
Grouped Summary with 'by()'
by(): Creates summary statistics grouped by a specific column.
Syntax:
by(<data frame object>, <data frame object>$<column>, summary)Example:
by(real_estate, real_estate$Location, summary)gives summaries for each location.
Central Tendency
To calculate central tendency (Mean, Median, Mode):
Mean Calculation with NA Handling:
mean(<data frame object>$<column>)returns NA if missing values exist.mean(<data frame object>$<column>, na.rm=TRUE)excludes NA values from the calculation.Median Calculation:
Syntax:
median(<data frame object>$<column>)Calculation of mode needs manual counting, typically not built-in.
Dispersion Measures in R
Range:
Calculation: ext{diff(range($))} or ext{max($) - min($)}.
Frequency Distribution Table
Frequency Distribution with 'table()':
Syntax:
table(<data frame object>$<column>)Example: For Locations:
table(real_estate$Location).
Tabulating Statistical Measures by Category
tapply(): Tabulates statistics based on categories from another column.
Syntax:
tapply(<data frame object>$<numerical column>, <data frame object>$<categorical column>, <statistical measure>)Example:
tapply(real_estate$Price, real_estate$Location, mean)averages prices by location.
Using Measures to Replace Values
Example: Modify an entry dynamically:
real_estate[7,3] = mean(real_estate$PriceUSD)This substitutes the value in row 7, column 3 with the calculated mean of 'PriceUSD'.
4.3 Basic Statistical Visualizations
Plotting Commands in R
Barplot: Represents frequencies of a categorical variable.
Syntax:
barplot(table(<data frame object>$<categorical column>), col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>")
Histograms
Histogram: Displays frequency of numerical variables within range classes.
Classes on the x-axis, frequencies on the y-axis.
Syntax:
hist(<data frame object>$<numerical column>, col = "<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>")
Boxplots
Boxplot: Visual representation based on the 5-number summary.
Displays minima, maxima, quartiles, and outliers.
Syntax for one numerical variable:
boxplot(<data frame object>$<numerical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>")Syntax for categorical group comparison:
boxplot(<data frame object>$<numerical column> ~ <data frame object>$<categorical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>", horizontal = TRUE)
Understanding Quartiles and IQR
Quartiles: Divide data set into four equal parts.
IQR: ext{IQR} = Q3 - Q1
Outliers defined as values below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
Extreme outliers: values below Q1 - 3(IQR) or above Q3 + 3(IQR).
Scatterplots
Definition: Used for visualizing relationships between two numerical variables.
Hypothesized independent variable on x-axis, dependent variable on y-axis.
Correlation does not imply causation.
Syntax:
plot(<data frame object>$<numerical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab = "<y-axis label>")
Correlation in Scatterplots
Positive correlation: Dots trend upwards.
Higher independent values correspond with higher dependent values.
Negative correlation: Dots trend downwards.
Higher independent values correspond with lower dependent values.
No relationship: Dots display no clear trend.
Plotting Two Numerical Variables
Syntax:
plot(<data frame object>$<numerical column>, <data frame object>$<numerical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab = "<y-axis label>")