Descriptive Analytics in Information Systems and Supply Chain Management

INFORMATION SYSTEMS AND SUPPLY CHAIN MANAGEMENT

WEEK 5: DESCRIPTIVE ANALYTICS

Chapter 4
  • Course Drawn from Loyola University Chicago


4.1 Purpose of Descriptive Statistics

  • Descriptive statistics are used to summarize and describe the characteristics of a dataset.

    • Central Tendency Measures: Indicate where data points are centered.

    • Mean: The average value.

      • Calculation:

      • Mean is computed as ext{Mean} = rac{ ext{Sum of all values}}{ ext{Number of observations}}.

      • Limitation: Sensitive to extreme values (outliers).

    • Median: The middle value in a sorted dataset. (Second quartile or Q2)

      • Property: Robust to outliers; altering extreme values does not affect it.

    • Mode: The most frequently occurring value in the dataset.

Measures of Dispersion
  • Dispersion Measures: Indicate the spread of data points.

    • Range: Difference between the maximum and minimum values in the dataset.

    • Calculation: ext{Range} = ext{Max} - ext{Min}.

    • Interquartile Range (IQR): Difference between first quartile (Q1) and third quartile (Q3); indicates the middle 50% of the data.

    • Calculation: ext{IQR} = Q3 - Q1.

    • Standard Deviation: Indicates how spread out data points are around the mean.

Additional Descriptive Aspects
  • Frequency Distribution: Shows how often each value occurs in the dataset.

  • Skewness: Measures the asymmetry of the distribution of data.

  • Percentiles: Values below which a given percentage of observations fall.

  • Shape: The overall distribution pattern observed in the data.


4.2 Commands in R

General Statistics Commands
  • General R commands for data analysis in R:

    • nrow(): Provides total number of rows in a data frame.

    • Example: nrow(real_estate) returns the number of rows in the real_estate data frame.

    • ncol(): Provides total number of columns in a data frame.

    • Example: ncol(real_estate) returns the number of columns in the real_estate data frame.

Head and Tail Functions
  • head(): Displays the first n observations of a data frame.

    • Syntax: head(<data frame object>, n=<number of rows to show>)

  • tail(): Displays the last n observations in a data frame.

    • Syntax: tail(<data frame object>, n=<number of rows to show>)

Unique Values Command
  • unique(): Returns distinct values from a specified column.

    • Example: unique(real_estate$Location)

    • Can be used to create a new data frame without duplicates:

    • Syntax: <new data frame object name> = unique(<data frame object>)

Summary Statistics
  • summary(): Returns essential summary statistics:

    • Outputs include Minimum (Min.), First Quartile (1st Qu.), Median (Q2), Mean, Third Quartile (3rd Qu.), Maximum values.

    • Together these represent the 5-number summary, useful for boxplots and skewness analysis.

    • Syntax: summary(<data frame object>)

  • To get summary for a specific column:

    • Syntax: summary(<data frame name>$<column name>)

Grouped Summary with 'by()'
  • by(): Creates summary statistics grouped by a specific column.

    • Syntax: by(<data frame object>, <data frame object>$<column>, summary)

    • Example: by(real_estate, real_estate$Location, summary) gives summaries for each location.

Central Tendency
  • To calculate central tendency (Mean, Median, Mode):

    • Mean Calculation with NA Handling:

    • mean(<data frame object>$<column>) returns NA if missing values exist.

    • mean(<data frame object>$<column>, na.rm=TRUE) excludes NA values from the calculation.

    • Median Calculation:

    • Syntax: median(<data frame object>$<column>)

    • Calculation of mode needs manual counting, typically not built-in.

Dispersion Measures in R
  • Range:

    • Calculation: ext{diff(range($))} or ext{max($) - min($)}.

Frequency Distribution Table
  • Frequency Distribution with 'table()':

    • Syntax: table(<data frame object>$<column>)

    • Example: For Locations: table(real_estate$Location).

Tabulating Statistical Measures by Category
  • tapply(): Tabulates statistics based on categories from another column.

    • Syntax: tapply(<data frame object>$<numerical column>, <data frame object>$<categorical column>, <statistical measure>)

    • Example: tapply(real_estate$Price, real_estate$Location, mean) averages prices by location.

Using Measures to Replace Values
  • Example: Modify an entry dynamically:

    • real_estate[7,3] = mean(real_estate$PriceUSD)

    • This substitutes the value in row 7, column 3 with the calculated mean of 'PriceUSD'.


4.3 Basic Statistical Visualizations

Plotting Commands in R
  • Barplot: Represents frequencies of a categorical variable.

    • Syntax: barplot(table(<data frame object>$<categorical column>), col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>")

Histograms
  • Histogram: Displays frequency of numerical variables within range classes.

    • Classes on the x-axis, frequencies on the y-axis.

    • Syntax: hist(<data frame object>$<numerical column>, col = "<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>")

Boxplots
  • Boxplot: Visual representation based on the 5-number summary.

    • Displays minima, maxima, quartiles, and outliers.

    • Syntax for one numerical variable:

    • boxplot(<data frame object>$<numerical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>")

    • Syntax for categorical group comparison:

    • boxplot(<data frame object>$<numerical column> ~ <data frame object>$<categorical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab= "<y-axis label>", horizontal = TRUE)

Understanding Quartiles and IQR
  • Quartiles: Divide data set into four equal parts.

  • IQR: ext{IQR} = Q3 - Q1

  • Outliers defined as values below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).

    • Extreme outliers: values below Q1 - 3(IQR) or above Q3 + 3(IQR).

Scatterplots
  • Definition: Used for visualizing relationships between two numerical variables.

    • Hypothesized independent variable on x-axis, dependent variable on y-axis.

    • Correlation does not imply causation.

  • Syntax: plot(<data frame object>$<numerical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab = "<y-axis label>")

Correlation in Scatterplots
  • Positive correlation: Dots trend upwards.

    • Higher independent values correspond with higher dependent values.

  • Negative correlation: Dots trend downwards.

    • Higher independent values correspond with lower dependent values.

  • No relationship: Dots display no clear trend.

Plotting Two Numerical Variables
  • Syntax: plot(<data frame object>$<numerical column>, <data frame object>$<numerical column>, col ="<insert color name>", main="<title of plot>", xlab = "<x-axis label>", ylab = "<y-axis label>")