Vis Analytics

Administrative and Exam Logistics

There are extra copies of some material at the bookstore; if you didn’t order one, they’ll be available at the bookstore when they arrive.
A video on Excel charts and graphs was planned for upload over the weekend; it will be posted eventually, not required immediately.
Personal update: the instructor’s mother went to the hospital Friday night and passed away Sunday; weekend work was affected.
Class impact: services will be next week by next Friday, but this will not affect the class schedule; class will meet Tuesday and Thursday as usual.
If a student feels they didn’t receive a response or need something, please reach out again; the instructor will respond; the goal is ensure students have what they need, especially with an upcoming test.
Exam logistics:
- The exam is worth 75 points total, with 38 questions at 2 points each, totaling 76 possible points.
- The class aims for 900 out of 1000 points overall; this exam contributes to that total.
- The exam will be completed in class on student computers; if you normally use an iPad, bring a laptop; if you don’t have a laptop, notify the instructor and they’ll help arrange one (library options available).
- For students in the back row, there will be a temporary relocation to the back row during the exam, and the exam duration is approximately one hour.
- The in-class exam environment is designed so most students can complete it without a lockdown browser; the exam uses an in-platform system with shuffled questions and algorithmic numbers to minimize cheating.
- Excel will be used similarly to the homework; you should be comfortable with Excel tasks, as homework will be reflected on the test.
Exam preparation guidance:
- Review all homework in MyLab (not the reading or try-it portions, but the actual homework assignments—one per chapter).
- The test will reflect the same look and feel as homework; most questions come from previous homework, with some from the Week 1 quiz (history of data visualization).
- Learn the vocabulary from chapters 1–4; the textbook contains blue-highlighted boxes with definitions; there is no prebuilt flashcard set in the publisher’s system, so students should create their own flashcards or use index cards.
- Access the Gradebook and use the Review feature to revisit what you did on homework and read the questions.
General study approach:
- Chapters 1–4 vocabulary and core concepts should be reviewed thoroughly; focus on definitions and how terms are used in context.
- The instructor recommends active study strategies (e.g., index cards, reviewing blue boxes, and summarizing key ideas).

Core Concepts in this Unit

Two main branches of statistics:
- Descriptive statistics (descriptive analytics, exploratory data analysis): describe data, summarize features, and tell clear stories about datasets.
- Inferential statistics: take information from a sample to make inferences about a population.
Descriptive analytics focus: exam-like measures (mean, median, mode, variability, shape, and associations) used to describe current/historical data and set benchmarks; forecasting involves using past data to project future outcomes.
Data storytelling and real-world context:
- Stories in each chapter illustrate how descriptive statistics and analytics are used in business contexts.
- Sam’s vignette (buying a used car) introduces concepts like central tendency, variability, shape, and outliers in a practical setting.
Foundational terms:
- Central tendency: measures that describe the center of a dataset (mean, median, mode).
- Variability (spread): how far data points are from each other (range, variance, standard deviation).
- Shape of distribution: skewness and symmetry (left-skewed, right-skewed, symmetrical).
- Associations: relationships between two quantitative variables (as seen in scatter plots).
- Descriptive analytics (EDA): using statistics to describe and summarize data features, often with visualizations.
Real-world examples and contexts:
- A large dataset example for car data (426,000 entries) illustrating the practical challenges of working with big data and how to summarize it.
- Shopify case study shows Pareto-based insights (80/20) driving business decisions.

Descriptive Statistics and Descriptive Analytics

Descriptive statistics are basic data descriptions that typically fall into four categories:
- Central tendency (mean, median, mode)
- Variability (range, variance, standard deviation)
- Shape of distribution (skewness, symmetry, outliers)
- Associations (relationships between variables in a dataset)
Descriptive analytics (EDA) expands on these ideas by using statistical techniques to describe and summarize data features, often with visual metrics.
Forecasting context: describe past data to infer possible future trends; example from business: forecasting notebook sales by analyzing past 12–18 months of sales data.
Chapter vignette context: Sam buys a car using a dataset and concepts from descriptive statistics to understand the market and expected values.
Data and case studies in the chapter:
- Used-car dataset (example): fields included ID, price, year, model, condition, cylinders, fuel, odometer, transmission, vehicle type, state, etc.; illustrates handling large data and the need for descriptive summaries before deeper analyses.
- Shopify example (case study): categorizes product lines by revenue share (e.g., products A, B, C contributing 80%, 15%, 5% of revenue) and emphasizes the Pareto principle in practice.
Pareto principle (80/20 rule): a recurring theme in business analytics:
- 80% of results often come from 20% of causes.
- The rule helps focus attention on the “vital few” causes, products, or customers that generate the majority of outcomes, with the remaining 80% having diminished returns.
- Applications and caveats:
- Product focus: 80% of revenue from 20% of products; 20% of products may occupy most warehouse space; decisions on inventory and product focus should consider this.
- Customer focus: 20% of customers may generate 80% of revenue; consider targeting and servicing these key customers more intensively, while maintaining service for the rest.
- HR and operational considerations: 80% of HR problems may originate from 20% of employees; similarly for customer complaints.
- Important caveat: the exact percentages can vary (e.g., 75/25, 82/18); the principle is about prioritizing the vital few rather than ignoring the rest.
Practical implications for study and business analyses:
- Pareto analyses help streamline decision-making, inventory management, and resource allocation.
- The approach helps identify where to invest energy and which areas to trim to avoid waste.

Central Tendency

Definition: central tendency describes the value around which data clusters; it represents the middle or typical value of a dataset.
The main measures:
- Mean (average):
- Formula: $\bar{x} = rac{1}{n} abla abla abla{i=1}^n xi$ (i.e., the sum of all observations divided by the number of observations)
- Median: the middle value of a sorted dataset; if even n, the average of the two middle values.
- Mode: the most frequently occurring value(s) in the dataset; can be unimodal or multimodal.
Trimmed means:
- 90% trimmed mean concept (as discussed in class): remove the extreme values from the ends to focus on the central 90% of the data; for a small example, this was described as removing 5 values from each end when there are 12 data points (
  note: this example in the lecture appeared to mix the 90% trim with a 10% edge case; the standard interpretation is to trim 5% from each end when you have a reasonably large dataset, leaving the central 90%).
Example context (used car dataset and average price):
- Reported mean price: $\bar{x} = 26.05 ext{ (thousands)}$
- Reported 90% trimmed mean: ext{trimmed mean} ext{ (central 90%)} o 21.04 ext{ (thousands)}
- Median price: $ext{median} = 20{,}000 ext{ (assuming dollars and thousands context)}$
Interpreting mean vs median and distribution shape:
- If the mean and median are equal or very close, the distribution is roughly symmetric.
- If the mean is greater than the median, the distribution is skewed to the right (positive skew).
- If the mean is less than the median, the distribution is skewed to the left (negative skew).
Example intuition:
- In the car example, a higher mean price relative to the median suggests some high-priced outliers pulling the average up, resulting in right skew.

Variability and Shape of the Distribution

Variability (spread) describes how dispersed data are around the center.
Common measures:
- Range: $R = ext{max}(xi) - ext{min}(xi)$
- Variance:
- For a population: $\sigma^2 = rac{ abla{i=1}^n (xi - \bar{ ext{ }x})^2}{n}$
- For a sample: $s^2 = rac{ abla{i=1}^n (xi - \bar{x})^2}{n-1}$
- Standard deviation:
- Population: $\sigma = abla \surd abla^2 = \surd \bigl( rac{ abla{i=1}^n (xi - \bar{x})^2}{n}\bigr)$
- Sample: $s = \surd s^2 = \surd \bigl( rac{ abla{i=1}^n (xi - \bar{x})^2}{n-1}\bigr)$
Key interpretation:
- Smaller standard deviation indicates data are less variable around the mean; greater standard deviation indicates more spread and variability.
Worked example context (odometer readings):
- Range example: from 0 to 218{,}000 miles, range ≈ 218{,}000 miles (illustrative reading from the dialogue).
- Variance example from the walkthrough: a set of squared deviations summed to 484; with n = 12, the variance (sample) is $s^2 = \frac{484}{11} = 44.$
- Standard deviation: $s = \sqrt{44} \approx 6.63.$ (units corresponding to the data, e.g., thousands of miles if data were in thousands.)
Relationship among mean, median, and mode for distribution shape:
- When the data are roughly symmetrical, mean = median = mode.
- If the distribution is skewed, mean ≠ median; the direction of skewness affects which statistic is pulled toward the tail.
The empirical rule (for approximately normal distributions):
- About 68% of data within 1 standard deviation: $P(|X - \bar{x}| \,\le\, s) \approx 0.68$
- About 95% within 2 standard deviations: $P(|X - \bar{x}| \,\le\, 2s) \approx 0.95$
- About 99.7% within 3 standard deviations: $P(|X - \bar{x}| \,\le\, 3s) \approx 0.997$
Z-scores:
- Definition: $z = \frac{x - \bar{x}}{s}$ (how many standard deviations an observation is from the mean).
- Z-scores help compare data points across different scales.

Practical Calculations: Hand and Excel Approaches

Hand calculations (illustrative steps used in the lecture):
- Step 1: Arrange data in ascending order to reduce mistakes and make it easy to locate the min, max, range, and median.
- Step 2: Compute the mean: $\bar{x} = \frac{1}{n} \sum{i=1}^n xi$ . In the example, the sum was 228 with n = 12, giving $\bar{x} = \frac{228}{12} = 19.$ (units consistent with the data, e.g., thousands of dollars)
- Step 3: Compute the median for n = 12 (even): the middle two values are the 6th and 7th values; the median is the average of these two. In the example, the median turned out to be 19 (same as the mean in this case).
- Step 4: Determine the mode(s): the value(s) that occur most frequently; in the example, the modes were 15 and 22 (bimodal).
- Step 5: Compute the range: $R = \max(xi) - \min(xi);$ in the example, range = 30 - 10 = 20.
- Step 6: Compute the variance (sample, per the lecture):
- Compute each deviation: $xi - \bar{x}$ , square them, sum: (\sum (xi - \bar{x})^2 = 484).
- Then divide by (n - 1): $s^2 = \frac{484}{11} = 44.$
- Step 7: Compute the standard deviation: $s = \sqrt{44} \approx 6.63.$
- Step 8: Interpret the spread around the mean using the standard deviation and, if helpful, the 68-95-99.7 rule.
Excel implementation (as demonstrated):
- Mean: $\bar{x} = \text{=AVERAGE(range)}$
- Median: $\text{=MEDIAN(range)}$
- Mode (all modes): $\text{=MODE.MULT(range)}$ (Excel will spill multiple values; ensure you have space to display all modes)
- Range: $\text{=MAX(range)} - \text{=MIN(range)}$
- Variance (sample): $\text{=VAR.S(range)}$
- Standard deviation (sample): $\text{=STDEV.S(range)}$
Notes on Excel nuances (as discussed):
- Excel has multiple variants of mode (MODE.SNGL vs MODE.MULT); use MODE.MULT to capture multiple modes.
- When using MODE.MULT, Excel may spill results into adjacent cells; plan the layout accordingly.
- If you see an “overflow” or spill issue with the Mode function, ensure the destination range has enough empty cells for all mode values.

Exam Guidance and Preparation Tips

Exam format and expectations:
- 75 points total; 38 questions at 2 points each (total = 76, but the official total is stated as 75 points in the session); the discrepancy is likely due to specific scoring rules or a rounding convention; treat the 75-point total as the stated exam score.
- Most questions will resemble homework problems; some questions will include content from the Week 1 quiz (history of data visualization).
- The exam will be taken in-class on a computer; bring a laptop if you don’t typically use one (library laptops available if needed).
- No notes allowed; one attempt; no lockdown browser required for this exam; questions will be randomized with algorithmic components to prevent cheating; Excel may be used for data-entry/computations, as with homework.
Study resources and strategy:
- Focus your review on homework assignments in MyLab (one per chapter) as the primary source for exam content.
- Review vocabulary terms across chapters 1–4; use blue-highlighted boxes in the textbook for quick reference to definitions.
- Use the Gradebook review feature to reread questions and understand the expected solutions.
- Practice flashcards or index cards for vocabulary since the textbook’s built-in flashcards are not pre-populated in this course platform.
Key content areas to master for the unit:
- Descriptive statistics and descriptive analytics concepts (central tendency, variability, shape, associations).
- Distinctions between descriptive vs. inferential statistics; why inferences are drawn from samples to populations.
- Central tendency measures (mean, median, mode) and the construction/use of a trimmed mean (e.g., 90% trimmed mean) and its purpose (outlier resistance).
- Variability measures (range, variance, standard deviation) and their interpretation in real data contexts.
- Shape and distribution concepts (skewness, symmetry) and their impact on mean/median interpretations.
- Z-scores and the empirical rule for normal distributions (68-95-99.7).
- Practical interpretation of the Pareto principle (80/20) and its business implications (revenue concentration, inventory decisions, customer focus, and HR considerations).
Real-world connections and examples to reinforce concepts:
- Shopify case study demonstrates how a platform’s analytics might show that 80% of revenue comes from 20% of products; helps prioritize product development and marketing efforts.
- Pareto principle applied to customers, products, inventory, and HR problems; illustrates how distributional insights influence resource allocation and strategic decisions.
- The instructor highlights a use-case for a large dataset (426,000 observations) to illustrate the practical challenges of descriptive analytics in big data contexts.

Quick Reference: Formulas and Key Values (LaTeX)

Mean (sample or population is the same for the formula, but denoms differ in variance):
- $\bar{x} = \frac{1}{n} \sum{i=1}^n xi$
Median: value at the middle of an ordered list (average of the two middle values if n is even).
Mode: most frequent value(s) in the data.
Range:
- $R = \maxi xi - \mini xi$
Variance and Standard Deviation:
- Population variance: $\sigma^2 = \frac{1}{n}\sum{i=1}^n (xi - \mu)^2$
- Sample variance: $s^2 = \frac{1}{n-1}\sum{i=1}^n (xi - \bar{x})^2$
- Population standard deviation: $\sigma = \sqrt{\sigma^2}$
- Sample standard deviation: $s = \sqrt{s^2}$
Z-score:
- $z = \frac{x - \bar{x}}{s}$
Empirical rule (approximate for normal distributions):
- Within 1 standard deviation: ~68%
- Within 2 standard deviations: ~95%
- Within 3 standard deviations: ~99.7%
Trimmed mean (conceptual): remove a fixed percentage of the extreme values from both ends and compute the mean on the remaining data (e.g., central 90% of data).

Final Takeaways

Descriptive statistics and descriptive analytics provide the foundational tools for summarizing data and telling stories with numbers.
The Pareto principle is a powerful heuristic for prioritization in business analytics and decision-making.
When interpreting data, compare mean and median to infer distribution shape and consider outliers via trimmed means or standard deviation.
Excel is a practical tool for doing these calculations, but understanding the underlying formulas is essential for proper interpretation and flexibility in analysis.
For exams, expect a blend of hand-calculation practice and Excel-based problems; focus on one-per-chapter homework as the primary guide to future questions.