Statistics and Excel Review — Lecture Notes

Plan for Today

  • Q&A and review based on feedback surveys from last week
  • Review of what we learned last week in Excel
  • Learn some new Excel skills
  • Get you ready for the new semester

Meet Tyler: Your AI TA

  • Tyler is introduced as the course’s AI assistant (TA) and has worked with the instructor for over a year.
  • Tyler is a resource alongside the instructor for questions about homework or general course navigation.
  • Office hours are by appointment and flexible; you can email with questions.
  • The instructor and Tyler are available as resources; you can reach out to either one for help.
  • The instructor emphasizes responsiveness to feedback surveys and personal replies to student feedback.

Office Hours, Communication, and Feedback

  • Office hours are by appointment; email questions are encouraged.
  • Feedback surveys: the instructor reads all comments and replies; students should not just report a numeric score but engage with the feedback.
  • If you’re struggling, contact either the instructor or Tyler early rather than waiting.
  • Clarification: questions about homework should be directed to office hours or email, not to the lecture feedback section.

Wednesdays: Structure and Strategy

  • Wednesdays are busy: lecture to watch, participation quiz, lecture feedback survey, and a homework due.
  • Advice: avoid procrastinating on homework; weekend or Monday time blocks are best for completion.
  • If you don’t have questions about homework, you can still submit it on Wednesday; if you have questions, use Monday or Tuesday office hours.
  • The goal is to avoid ending up with a heavy workload on Wednesday night.
  • Workshops are recorded and available on Canvas; you can review them later.
  • The instructor notes that some formatting tips (e.g., coloring, bolding) may be covered in workshops; students should not rely on Google for everything—strong guidance is provided.

Workshops and Recordings

  • Workshops are recorded and accessible on Canvas.
  • Most homework content is covered in some form in the workshops; there are hints and formatting options discussed.
  • If you miss class, you can review the workshop recordings to fill in gaps.

Questions, Questions, Questions

  • The instructor encourages asking questions as a sign of learning, bravery, and academic citizenship.
  • Asking questions helps classmates too because it provides answers or clarifications that others may benefit from.
  • Students should feel free to ask about any topic, not just the skew or bins.

Skew, Multimodality, and Distribution Shapes

  • Skew: not symmetric; defined by the tail direction, not where most data lies.
  • Positive skew (skewed to the right): tail extends to the positive direction.
  • Negative skew (skewed to the left): tail extends to the negative direction.
  • A quick classroom trick (a playful stat-impression): narrate yourself as the histogram and point to the tail; practice makes the concept intuitive.
  • When asked to imagine a distribution’s shape, you should describe modality (unimodal, bimodal) and skew (positive, negative, or symmetric).
  • A Kilroy’s example: describe a plausible distribution of drinks per night; you can propose unimodal with a peak around a central value and tails on both sides, but the exact shape may vary depending on reasoning.
  • The instructor accepts reasonable, well-explained descriptions even if they don’t match a single “correct” shape.
  • A Harvard GPA example: normal GPA range is 0–4; discussion of ceiling (near 4) and floor (near 0) effects; expected data pile near the top (ceiling) with tails downward and upward depending on student outcomes.
  • Ceiling effect: data pile up at the upper limit; floor effect: data pile up at the lower limit.
  • If both floor and ceiling effects occur, the distribution could be bimodal and potentially symmetric depending on the relative pile sizes.
  • Practical reflections: even with fixed bounds, data can be skewed without a hard floor/ceiling effect if tails exist.
  • Real-world relevance: understanding distribution shape helps interpret means, medians, modes, and informs critical evaluation of statistics in real-world claims.
  • Answering shape questions well requires articulating a plausible data-generating process and using proper terminology.
  • Questions about data shapes are tied to later topics such as mean, median, and mode, and how they respond to distribution shape.
  • The value of imagining shapes: helps you be a critical consumer of statistics and recognize misleading presentations.
  • Practical note: small samples can misrepresent the true shape and affect early course grades; be mindful of sample size when interpreting grades or feedback.

Bins, Histograms, and Data Binning Rules

  • Bins are the intervals that histograms use to group data.
  • Three practical rules for bins discussed:
    • Rule 1: Start with about 10 bins to get a good sense of shape; 10 is a starting point, not a hard rule.
    • Rule 2: Choose a bin size that is a nice round number and makes the range coverable in roughly 10 bins (e.g., range 0–5 with bin size 0.5 yields 10 bins).
    • Rule 3: Start on multiples of the chosen bin size (e.g., 0, 0.5, 1.0, …). Avoid starting on a non-round lower bound like 0.03.
  • It’s acceptable for some bins to be empty.
  • There are cases where a non-10-bin setup is better for answering specific questions; the general guideline is to use around 10 bins for shape exploration.
  • Alternative bin counts (8, 12, etc.) can be used depending on the question at hand.
  • In practice, you should select a bin size that yields a clean, round-numbered bin range and ensure the lower bound starts at a round number.
  • Example discussion: Harvard GPAs with a range 0–4 would suggest 10 bins with a width of 0.4 if strictly following 0–4, but the instructor recommends a round-number bin width (e.g., 0.5) and starting at a round lower bound to get ~10 bins.
  • Notes on interpretation: some bins may be very sparse or empty; the key is to capture shape rather than perfection.
  • Special cases: some questions may benefit from non-standard bin counts or bin sizes depending on the distribution and the precise question asked.
  • Quick recap: 10-bin starting point, round bin width, lower bound on a round number; adjust as needed for clarity and the question.

Floor and Ceiling Effects and Skew

  • Floor effect: many observations pile up at the minimum possible value (lower bound).
  • Ceiling effect: many observations pile up at the maximum possible value (upper bound).
  • Both floor and ceiling effects can produce skew; however, skew can also occur without explicit floor/ceiling effects if there is a long tail in one direction.
  • You can have distributions with fixed bounds that are still skewed if most data lie away from the bounds and tails extend in one direction.
  • Bowling scores example: bowling has a hard upper limit (300) but it’s not common to hit that, so skew isn’t necessarily due to a ceiling in practice.
  • If both a floor and ceiling are heavily populated, the distribution could become bimodal; depending on which pile is larger and the tail directions, the skew could be negative or positive.
  • Practical exam tip: when asked to describe shape, you should mention modality (unimodal/bimodal) and skew (symmetric, left-skewed, right-skewed).

Scale vs Ordinal Data and Ranking

  • Ordinal data: conveys order but not equal spacing between ranks (e.g., rank 1, rank 2, rank 3 without implying equal distances).
  • Scale data (interval/ratio): distances between values are meaningful and uniform (e.g., lengths, GPA scale where differences are consistent).
  • Example discussion from Tyler: a grade rank column is ordinal if only the order is known; if the distances between ranks are meaningful and uniform, it could be scaled.
  • Distinguishing features:
    • Ordinal: order matters; distances unknown or not equal.
    • Scale: distance between values is meaningful and constant.
  • Practical takeaway: when you need to compare central tendency or variability, know whether your data are ordinal or scaled, because that changes which statistical summaries are appropriate.
  • The instructor emphasizes asking questions about levels of measurement as a core skill for data interpretation.

Small Sample Sizes, Grades, and Data Interpretation

  • Acknowledge that small samples can be misleading for interpreting distributions or making inferences.
  • Canvas grade visibility: early in the semester, grade displays may be unstable due to small sample sizes; expect the grade to be “hidden” or not fully representative until more data are collected.
  • The instructor plans to unlock grade visibility later in the term once data become more representative.
  • Advice: maximize performance across quizzes, homework, and participation to avoid misinterpretation due to small-sample noise.
  • Excel and grading practice: the course teaches how to download and compute grades in Excel as a learning exercise.

Excel Review: Getting Started with the Movie Data Workbook

  • Data file: a workbook with multiple movie-related columns used for five weeks of practice.
  • Initial columns include:
    • Movie title
    • MPAA rating (e.g., G, PG, PG-13, R)
    • Ratings from Rotten Tomatoes critics (RT critic), Rotten Tomatoes fans (RT fan), IMDB, and Metacritic
    • A synthetic column called Rick rate for demonstration purposes
    • Box office revenue
    • A set of “flags” (binary columns) indicating movie type (action, animation, family, drama, etc.)
  • For today’s tasks: delete all the flag columns and focus on the core data (title, rating, critic/fan/IMDb/Metacritic scores, box office).
  • Formatting steps demonstrated:
    • Bold headers for columns (Row 1) using Ctrl+B or the Bold option
    • Auto-adjust column width to fit contents
    • Center-align rating columns for readability
    • Sort by year released (oldest to newest) with data-wide selection to keep rows intact
    • Use Custom Sort to ensure the entire row is sorted by the chosen column (e.g., Year)
    • If Excel misidentifies headers, check the "My data has headers" option to restore header rows and proper sorting
  • Saving and exporting:
    • Save progress frequently (File > Save)
    • Save as PDF with consideration for readability: prefer "Fit to paper width" so that all columns fit on one or a few pages, avoiding overly small text
    • If Adobe Acrobat is installed, you can Save as Adobe PDF; otherwise, use Print to PDF with scaling options to fit columns to one page width
  • Homework formatting rules:
    • All answers must be typed; handwriting is not allowed to speed up feedback and consistency
    • Do not change the document’s formatting, indentation, or color; keep the instructor’s formatting conventions
    • Do not start a new document or alter the structure; consistent layout aids grading and feedback
    • Non-compliant submissions may lose credit due to grading inefficiency

Excel Formulas and Cell References: Getting Started

  • Excel formulas always start with an equals sign (=).
  • Example: computing the minimum RT critic value manually vs. using a function.
  • Basic approach using built-in functions (to explore):
    • MIN: finds the smallest value in a range
    • MAX: finds the largest value in a range
    • SUM: sum of values in a range
    • AVERAGE: arithmetic mean; note the instructor’s point that there is no dedicated MEAN function; in Excel, AVERAGE computes the mean
    • MEDIAN: middle value of the ordered data
    • MODE.SINGLE (or MODE in older versions): most frequent value
  • Naming and discovering functions:
    • Try typing a function name (e.g., MIN, MAX, SUM, MEDIAN, MODE) to see what Excel suggests
    • If you’re unsure, Google the function name to learn the exact syntax and arguments
  • Cell addressing basics:
    • A cell address has a column letter and a row number, e.g., A2, C4
    • Example: =A2 copies the value from cell A2 into the current cell; Excel updates automatically when A2 changes
  • Ranges (rectangular blocks of cells):
    • Represented as upper-left cell : lower-right cell, e.g., C2:E7
    • A range like C:C refers to the entire column C
    • For a formula: MIN(C:C) computes the minimum value across all of column C
  • Demonstrating ranges and formulas:
    • To find the minimum RT critic value across all rows: ext{MIN}(C:C)
    • To find the maximum RT critic value across all rows: ext{MAX}(C:C)
    • If you type a range like C2:C4 and drag the formula across columns, Excel will automatically adjust the column references (relative addressing)
  • Dragging and copying formulas across columns and rows:
    • When you drag a formula across columns, Excel updates the column references (C -> D -> E, etc.)
    • Dragging down rows updates row references accordingly
  • Central tendency in Excel:
    • Mean (arithmetic average): ar{x} = rac{1}{n} rac{ ext{sum of values}}{1}, computed in Excel as ext{AVERAGE}( ext{range})
    • Median: ext{Median}( ext{range})
    • Mode: ext{MODE.SINGLE}( ext{range}) (or MODE; mode multi may return multiple values)
  • Example worksheet tasks:
    • Compute the minimum RT critic value using MIN(C:C)
    • Compute the maximum RT critic value using MAX(C:C)
    • Compute the minimum RT fan value using MIN(D:D)
    • Compute the minimum IMDb value using MIN(E:E)
    • Drag formulas across to replicate for all rating columns (RT critic, RT fan, IMDb, etc.) so each column gets its own min/max
    • Compute the mean/median/mode for each rating column using AVERAGE, MEDIAN, and MODE.SINGLE respectively
  • Important note from the instructor:
    • There is no dedicated MEAN function in Excel; use AVERAGE for the mean and remember the mean is the arithmetic average
    • If a distribution has multiple modes, MODE.SINGLE may return a single mode; MODE.MULT can return multiple modes
  • Practical data-check tips:
    • If a dataset changes (new rows added), formulas that reference entire columns (C:C) automatically update
    • Always verify which ranges your formulas are referencing when dragging or extending data

Quick Takeaways and Real-World Relevance

  • The plan is to integrate theoretical understanding of distributions with hands-on Excel tools to analyze real data.
  • Being able to describe distribution shapes (modality and skew) is foundational for interpreting summary statistics and communicating findings clearly.
  • Understanding how to prepare data (formatting, headers, sorting) is essential for reliable analysis and reproducibility.
  • The connection to real-world data interpretation emphasizes critical thinking, not just mechanical calculation.

Quick Formulas Recap (LaTeX)

  • Minimum value in a data range:
    ext{Min} = \, ext{min}igl(x1, x2, ext{…}, x_nigr)
    or in Excel: ext{MIN}( ext{range})
  • Maximum value in a data range:
    ext{Max} = \, ext{max}igl(x1, x2, ext{…}, x_nigr)
    or in Excel: ext{MAX}( ext{range})
  • Sum of values in a data range:
    ext{Sum} = \sum{i=1}^{n} xi
    or in Excel: ext{SUM}( ext{range})
  • Arithmetic mean (the “mean”):
    ar{x} = rac{1}{n} \sum{i=1}^{n} xi
    in Excel: ext{AVERAGE}( ext{range})
  • Median:
    ext{Median} = ext{middle value of the ordered data}
    in Excel: ext{MEDIAN}( ext{range})
  • Mode:
    ext{Mode} = ext{the most frequent value}
    in Excel: ext{MODE.SINGLE}( ext{range}) (or MODE.MULT for multiple modes)
  • Range of cells (example):
    • Full column: C:C
    • Rectangular range: C2:E7
  • Mean/central tendency and the data range concept are foundational for choosing appropriate descriptive statistics depending on shape and measurement level.

Next Steps for You

  • Practice with the Movie Data workbook: delete extra columns, bold headers, adjust column widths, center alignment, sort by year, and save/export as PDF with fit-to-paper-width settings.
  • Practice creating and dragging MIN, MAX, SUM, AVERAGE, MEDIAN, and MODE.SINGLE formulas across multiple rating columns to reproduce results quickly.
  • Experiment with bin sizes and counts for histograms to build intuition about how shape changes with binning.
  • Reflect on distribution shapes in real data scenarios and articulate the modality and skew clearly using proper statistical terminology.
  • Keep communication with instructors and TAs open, and use office hours for clarifications early in the week to avoid end-of-week bottlenecks.