Vis Analytics

Administrative and Exam Logistics

  • There are extra copies of some material at the bookstore; if you didn’t order one, they’ll be available at the bookstore when they arrive.

  • A video on Excel charts and graphs was planned for upload over the weekend; it will be posted eventually, not required immediately.

  • Personal update: the instructor’s mother went to the hospital Friday night and passed away Sunday; weekend work was affected.

  • Class impact: services will be next week by next Friday, but this will not affect the class schedule; class will meet Tuesday and Thursday as usual.

  • If a student feels they didn’t receive a response or need something, please reach out again; the instructor will respond; the goal is ensure students have what they need, especially with an upcoming test.

  • Exam logistics:

    • The exam is worth 75 points total, with 38 questions at 2 points each, totaling 76 possible points.

    • The class aims for 900 out of 1000 points overall; this exam contributes to that total.

    • The exam will be completed in class on student computers; if you normally use an iPad, bring a laptop; if you don’t have a laptop, notify the instructor and they’ll help arrange one (library options available).

    • For students in the back row, there will be a temporary relocation to the back row during the exam, and the exam duration is approximately one hour.

    • The in-class exam environment is designed so most students can complete it without a lockdown browser; the exam uses an in-platform system with shuffled questions and algorithmic numbers to minimize cheating.

    • Excel will be used similarly to the homework; you should be comfortable with Excel tasks, as homework will be reflected on the test.

  • Exam preparation guidance:

    • Review all homework in MyLab (not the reading or try-it portions, but the actual homework assignments—one per chapter).

    • The test will reflect the same look and feel as homework; most questions come from previous homework, with some from the Week 1 quiz (history of data visualization).

    • Learn the vocabulary from chapters 1–4; the textbook contains blue-highlighted boxes with definitions; there is no prebuilt flashcard set in the publisher’s system, so students should create their own flashcards or use index cards.

    • Access the Gradebook and use the Review feature to revisit what you did on homework and read the questions.

  • General study approach:

    • Chapters 1–4 vocabulary and core concepts should be reviewed thoroughly; focus on definitions and how terms are used in context.

    • The instructor recommends active study strategies (e.g., index cards, reviewing blue boxes, and summarizing key ideas).

Core Concepts in this Unit

  • Two main branches of statistics:

    • Descriptive statistics (descriptive analytics, exploratory data analysis): describe data, summarize features, and tell clear stories about datasets.

    • Inferential statistics: take information from a sample to make inferences about a population.

  • Descriptive analytics focus: exam-like measures (mean, median, mode, variability, shape, and associations) used to describe current/historical data and set benchmarks; forecasting involves using past data to project future outcomes.

  • Data storytelling and real-world context:

    • Stories in each chapter illustrate how descriptive statistics and analytics are used in business contexts.

    • Sam’s vignette (buying a used car) introduces concepts like central tendency, variability, shape, and outliers in a practical setting.

  • Foundational terms:

    • Central tendency: measures that describe the center of a dataset (mean, median, mode).

    • Variability (spread): how far data points are from each other (range, variance, standard deviation).

    • Shape of distribution: skewness and symmetry (left-skewed, right-skewed, symmetrical).

    • Associations: relationships between two quantitative variables (as seen in scatter plots).

    • Descriptive analytics (EDA): using statistics to describe and summarize data features, often with visualizations.

  • Real-world examples and contexts:

    • A large dataset example for car data (426,000 entries) illustrating the practical challenges of working with big data and how to summarize it.

    • Shopify case study shows Pareto-based insights (80/20) driving business decisions.

Descriptive Statistics and Descriptive Analytics

  • Descriptive statistics are basic data descriptions that typically fall into four categories:

    • Central tendency (mean, median, mode)

    • Variability (range, variance, standard deviation)

    • Shape of distribution (skewness, symmetry, outliers)

    • Associations (relationships between variables in a dataset)

  • Descriptive analytics (EDA) expands on these ideas by using statistical techniques to describe and summarize data features, often with visual metrics.

  • Forecasting context: describe past data to infer possible future trends; example from business: forecasting notebook sales by analyzing past 12–18 months of sales data.

  • Chapter vignette context: Sam buys a car using a dataset and concepts from descriptive statistics to understand the market and expected values.

  • Data and case studies in the chapter:

    • Used-car dataset (example): fields included ID, price, year, model, condition, cylinders, fuel, odometer, transmission, vehicle type, state, etc.; illustrates handling large data and the need for descriptive summaries before deeper analyses.

    • Shopify example (case study): categorizes product lines by revenue share (e.g., products A, B, C contributing 80%, 15%, 5% of revenue) and emphasizes the Pareto principle in practice.

  • Pareto principle (80/20 rule): a recurring theme in business analytics:

    • 80% of results often come from 20% of causes.

    • The rule helps focus attention on the “vital few” causes, products, or customers that generate the majority of outcomes, with the remaining 80% having diminished returns.

    • Applications and caveats:

    • Product focus: 80% of revenue from 20% of products; 20% of products may occupy most warehouse space; decisions on inventory and product focus should consider this.

    • Customer focus: 20% of customers may generate 80% of revenue; consider targeting and servicing these key customers more intensively, while maintaining service for the rest.

    • HR and operational considerations: 80% of HR problems may originate from 20% of employees; similarly for customer complaints.

    • Important caveat: the exact percentages can vary (e.g., 75/25, 82/18); the principle is about prioritizing the vital few rather than ignoring the rest.

  • Practical implications for study and business analyses:

    • Pareto analyses help streamline decision-making, inventory management, and resource allocation.

    • The approach helps identify where to invest energy and which areas to trim to avoid waste.

Central Tendency

  • Definition: central tendency describes the value around which data clusters; it represents the middle or typical value of a dataset.

  • The main measures:

    • Mean (average):

    • Formula: ar{x} = rac{1}{n}
      abla
      abla
      abla{i=1}^n xi (i.e., the sum of all observations divided by the number of observations)

    • Median: the middle value of a sorted dataset; if even n, the average of the two middle values.

    • Mode: the most frequently occurring value(s) in the dataset; can be unimodal or multimodal.

  • Trimmed means:

    • 90% trimmed mean concept (as discussed in class): remove the extreme values from the ends to focus on the central 90% of the data; for a small example, this was described as removing 5 values from each end when there are 12 data points (
      note: this example in the lecture appeared to mix the 90% trim with a 10% edge case; the standard interpretation is to trim 5% from each end when you have a reasonably large dataset, leaving the central 90%).

  • Example context (used car dataset and average price):

    • Reported mean price: ar{x} = 26.05 ext{ (thousands)}

    • Reported 90% trimmed mean: ext{trimmed mean} ext{ (central 90%)} o 21.04 ext{ (thousands)}

    • Median price: extmedian=20,000ext(assumingdollarsandthousandscontext)ext{median} = 20{,}000 ext{ (assuming dollars and thousands context)}

  • Interpreting mean vs median and distribution shape:

    • If the mean and median are equal or very close, the distribution is roughly symmetric.

    • If the mean is greater than the median, the distribution is skewed to the right (positive skew).

    • If the mean is less than the median, the distribution is skewed to the left (negative skew).

  • Example intuition:

    • In the car example, a higher mean price relative to the median suggests some high-priced outliers pulling the average up, resulting in right skew.

Variability and Shape of the Distribution

  • Variability (spread) describes how dispersed data are around the center.

  • Common measures:

    • Range: R=extmax(x<em>i)extmin(x</em>i)R = ext{max}(x<em>i) - ext{min}(x</em>i)

    • Variance:

    • For a population: \sigma^2 = rac{
      abla{i=1}^n (xi - ar{ ext{ }x})^2}{n}

    • For a sample: s^2 = rac{
      abla{i=1}^n (xi - ar{x})^2}{n-1}

    • Standard deviation:

    • Population: \sigma =
      abla \surd
      abla^2 = \surd igl( rac{
      abla{i=1}^n (xi - ar{x})^2}{n}igr)

    • Sample: s = \surd s^2 = \surd igl( rac{
      abla{i=1}^n (xi - ar{x})^2}{n-1}igr)

  • Key interpretation:

    • Smaller standard deviation indicates data are less variable around the mean; greater standard deviation indicates more spread and variability.

  • Worked example context (odometer readings):

    • Range example: from 0 to 218{,}000 miles, range ≈ 218{,}000 miles (illustrative reading from the dialogue).

    • Variance example from the walkthrough: a set of squared deviations summed to 484; with n = 12, the variance (sample) is s2=48411=44.s^2 = \frac{484}{11} = 44.

    • Standard deviation: s=446.63.s = \sqrt{44} \approx 6.63. (units corresponding to the data, e.g., thousands of miles if data were in thousands.)

  • Relationship among mean, median, and mode for distribution shape:

    • When the data are roughly symmetrical, mean = median = mode.

    • If the distribution is skewed, mean ≠ median; the direction of skewness affects which statistic is pulled toward the tail.

  • The empirical rule (for approximately normal distributions):

    • About 68% of data within 1 standard deviation: P(|X - ar{x}| \,\le\, s) \approx 0.68

    • About 95% within 2 standard deviations: P(|X - ar{x}| \,\le\, 2s) \approx 0.95

    • About 99.7% within 3 standard deviations: P(|X - ar{x}| \,\le\, 3s) \approx 0.997

  • Z-scores:

    • Definition: z=xxˉsz = \frac{x - \bar{x}}{s} (how many standard deviations an observation is from the mean).

    • Z-scores help compare data points across different scales.

Practical Calculations: Hand and Excel Approaches

  • Hand calculations (illustrative steps used in the lecture):

    • Step 1: Arrange data in ascending order to reduce mistakes and make it easy to locate the min, max, range, and median.

    • Step 2: Compute the mean: xˉ=1n<em>i=1nx</em>i\bar{x} = \frac{1}{n} \sum<em>{i=1}^n x</em>i. In the example, the sum was 228 with n = 12, giving xˉ=22812=19.\bar{x} = \frac{228}{12} = 19. (units consistent with the data, e.g., thousands of dollars)

    • Step 3: Compute the median for n = 12 (even): the middle two values are the 6th and 7th values; the median is the average of these two. In the example, the median turned out to be 19 (same as the mean in this case).

    • Step 4: Determine the mode(s): the value(s) that occur most frequently; in the example, the modes were 15 and 22 (bimodal).

    • Step 5: Compute the range: R=max(x<em>i)min(x</em>i);R = \max(x<em>i) - \min(x</em>i); in the example, range = 30 - 10 = 20.

    • Step 6: Compute the variance (sample, per the lecture):

    • Compute each deviation: x<em>ixˉx<em>i - \bar{x}, square them, sum: (\sum (xi - \bar{x})^2 = 484).

    • Then divide by (n - 1): s2=48411=44.s^2 = \frac{484}{11} = 44.

    • Step 7: Compute the standard deviation: s=446.63.s = \sqrt{44} \approx 6.63.

    • Step 8: Interpret the spread around the mean using the standard deviation and, if helpful, the 68-95-99.7 rule.

  • Excel implementation (as demonstrated):

    • Mean: xˉ==AVERAGE(range)\bar{x} = \text{=AVERAGE(range)}

    • Median: =MEDIAN(range)\text{=MEDIAN(range)}

    • Mode (all modes): =MODE.MULT(range)\text{=MODE.MULT(range)} (Excel will spill multiple values; ensure you have space to display all modes)

    • Range: =MAX(range)=MIN(range)\text{=MAX(range)} - \text{=MIN(range)}

    • Variance (sample): =VAR.S(range)\text{=VAR.S(range)}

    • Standard deviation (sample): =STDEV.S(range)\text{=STDEV.S(range)}

  • Notes on Excel nuances (as discussed):

    • Excel has multiple variants of mode (MODE.SNGL vs MODE.MULT); use MODE.MULT to capture multiple modes.

    • When using MODE.MULT, Excel may spill results into adjacent cells; plan the layout accordingly.

    • If you see an “overflow” or spill issue with the Mode function, ensure the destination range has enough empty cells for all mode values.

Exam Guidance and Preparation Tips

  • Exam format and expectations:

    • 75 points total; 38 questions at 2 points each (total = 76, but the official total is stated as 75 points in the session); the discrepancy is likely due to specific scoring rules or a rounding convention; treat the 75-point total as the stated exam score.

    • Most questions will resemble homework problems; some questions will include content from the Week 1 quiz (history of data visualization).

    • The exam will be taken in-class on a computer; bring a laptop if you don’t typically use one (library laptops available if needed).

    • No notes allowed; one attempt; no lockdown browser required for this exam; questions will be randomized with algorithmic components to prevent cheating; Excel may be used for data-entry/computations, as with homework.

  • Study resources and strategy:

    • Focus your review on homework assignments in MyLab (one per chapter) as the primary source for exam content.

    • Review vocabulary terms across chapters 1–4; use blue-highlighted boxes in the textbook for quick reference to definitions.

    • Use the Gradebook review feature to reread questions and understand the expected solutions.

    • Practice flashcards or index cards for vocabulary since the textbook’s built-in flashcards are not pre-populated in this course platform.

  • Key content areas to master for the unit:

    • Descriptive statistics and descriptive analytics concepts (central tendency, variability, shape, associations).

    • Distinctions between descriptive vs. inferential statistics; why inferences are drawn from samples to populations.

    • Central tendency measures (mean, median, mode) and the construction/use of a trimmed mean (e.g., 90% trimmed mean) and its purpose (outlier resistance).

    • Variability measures (range, variance, standard deviation) and their interpretation in real data contexts.

    • Shape and distribution concepts (skewness, symmetry) and their impact on mean/median interpretations.

    • Z-scores and the empirical rule for normal distributions (68-95-99.7).

    • Practical interpretation of the Pareto principle (80/20) and its business implications (revenue concentration, inventory decisions, customer focus, and HR considerations).

  • Real-world connections and examples to reinforce concepts:

    • Shopify case study demonstrates how a platform’s analytics might show that 80% of revenue comes from 20% of products; helps prioritize product development and marketing efforts.

    • Pareto principle applied to customers, products, inventory, and HR problems; illustrates how distributional insights influence resource allocation and strategic decisions.

    • The instructor highlights a use-case for a large dataset (426,000 observations) to illustrate the practical challenges of descriptive analytics in big data contexts.

Quick Reference: Formulas and Key Values (LaTeX)

  • Mean (sample or population is the same for the formula, but denoms differ in variance):

    • xˉ=1n<em>i=1nx</em>i\bar{x} = \frac{1}{n} \sum<em>{i=1}^n x</em>i

  • Median: value at the middle of an ordered list (average of the two middle values if n is even).

  • Mode: most frequent value(s) in the data.

  • Range:

    • R=max<em>ix</em>imin<em>ix</em>iR = \max<em>i x</em>i - \min<em>i x</em>i

  • Variance and Standard Deviation:

    • Population variance: σ2=1n<em>i=1n(x</em>iμ)2\sigma^2 = \frac{1}{n}\sum<em>{i=1}^n (x</em>i - \mu)^2

    • Sample variance: s2=1n1<em>i=1n(x</em>ixˉ)2s^2 = \frac{1}{n-1}\sum<em>{i=1}^n (x</em>i - \bar{x})^2

    • Population standard deviation: σ=σ2\sigma = \sqrt{\sigma^2}

    • Sample standard deviation: s=s2s = \sqrt{s^2}

  • Z-score:

    • z=xxˉsz = \frac{x - \bar{x}}{s}

  • Empirical rule (approximate for normal distributions):

    • Within 1 standard deviation: ~68%

    • Within 2 standard deviations: ~95%

    • Within 3 standard deviations: ~99.7%

  • Trimmed mean (conceptual): remove a fixed percentage of the extreme values from both ends and compute the mean on the remaining data (e.g., central 90% of data).

Final Takeaways

  • Descriptive statistics and descriptive analytics provide the foundational tools for summarizing data and telling stories with numbers.

  • The Pareto principle is a powerful heuristic for prioritization in business analytics and decision-making.

  • When interpreting data, compare mean and median to infer distribution shape and consider outliers via trimmed means or standard deviation.

  • Excel is a practical tool for doing these calculations, but understanding the underlying formulas is essential for proper interpretation and flexibility in analysis.

  • For exams, expect a blend of hand-calculation practice and Excel-based problems; focus on one-per-chapter homework as the primary guide to future questions.