Statistics and Excel Review — Lecture Notes
Plan for Today
- Q&A and review based on feedback surveys from last week
- Review of what we learned last week in Excel
- Learn some new Excel skills
- Get you ready for the new semester
Meet Tyler: Your AI TA
- Tyler is introduced as the course’s AI assistant (TA) and has worked with the instructor for over a year.
- Tyler is a resource alongside the instructor for questions about homework or general course navigation.
- Office hours are by appointment and flexible; you can email with questions.
- The instructor and Tyler are available as resources; you can reach out to either one for help.
- The instructor emphasizes responsiveness to feedback surveys and personal replies to student feedback.
Office Hours, Communication, and Feedback
- Office hours are by appointment; email questions are encouraged.
- Feedback surveys: the instructor reads all comments and replies; students should not just report a numeric score but engage with the feedback.
- If you’re struggling, contact either the instructor or Tyler early rather than waiting.
- Clarification: questions about homework should be directed to office hours or email, not to the lecture feedback section.
Wednesdays: Structure and Strategy
- Wednesdays are busy: lecture to watch, participation quiz, lecture feedback survey, and a homework due.
- Advice: avoid procrastinating on homework; weekend or Monday time blocks are best for completion.
- If you don’t have questions about homework, you can still submit it on Wednesday; if you have questions, use Monday or Tuesday office hours.
- The goal is to avoid ending up with a heavy workload on Wednesday night.
- Workshops are recorded and available on Canvas; you can review them later.
- The instructor notes that some formatting tips (e.g., coloring, bolding) may be covered in workshops; students should not rely on Google for everything—strong guidance is provided.
Workshops and Recordings
- Workshops are recorded and accessible on Canvas.
- Most homework content is covered in some form in the workshops; there are hints and formatting options discussed.
- If you miss class, you can review the workshop recordings to fill in gaps.
Questions, Questions, Questions
- The instructor encourages asking questions as a sign of learning, bravery, and academic citizenship.
- Asking questions helps classmates too because it provides answers or clarifications that others may benefit from.
- Students should feel free to ask about any topic, not just the skew or bins.
Skew, Multimodality, and Distribution Shapes
- Skew: not symmetric; defined by the tail direction, not where most data lies.
- Positive skew (skewed to the right): tail extends to the positive direction.
- Negative skew (skewed to the left): tail extends to the negative direction.
- A quick classroom trick (a playful stat-impression): narrate yourself as the histogram and point to the tail; practice makes the concept intuitive.
- When asked to imagine a distribution’s shape, you should describe modality (unimodal, bimodal) and skew (positive, negative, or symmetric).
- A Kilroy’s example: describe a plausible distribution of drinks per night; you can propose unimodal with a peak around a central value and tails on both sides, but the exact shape may vary depending on reasoning.
- The instructor accepts reasonable, well-explained descriptions even if they don’t match a single “correct” shape.
- A Harvard GPA example: normal GPA range is 0–4; discussion of ceiling (near 4) and floor (near 0) effects; expected data pile near the top (ceiling) with tails downward and upward depending on student outcomes.
- Ceiling effect: data pile up at the upper limit; floor effect: data pile up at the lower limit.
- If both floor and ceiling effects occur, the distribution could be bimodal and potentially symmetric depending on the relative pile sizes.
- Practical reflections: even with fixed bounds, data can be skewed without a hard floor/ceiling effect if tails exist.
- Real-world relevance: understanding distribution shape helps interpret means, medians, modes, and informs critical evaluation of statistics in real-world claims.
- Answering shape questions well requires articulating a plausible data-generating process and using proper terminology.
- Questions about data shapes are tied to later topics such as mean, median, and mode, and how they respond to distribution shape.
- The value of imagining shapes: helps you be a critical consumer of statistics and recognize misleading presentations.
- Practical note: small samples can misrepresent the true shape and affect early course grades; be mindful of sample size when interpreting grades or feedback.
Bins, Histograms, and Data Binning Rules
- Bins are the intervals that histograms use to group data.
- Three practical rules for bins discussed:
- Rule 1: Start with about 10 bins to get a good sense of shape; 10 is a starting point, not a hard rule.
- Rule 2: Choose a bin size that is a nice round number and makes the range coverable in roughly 10 bins (e.g., range 0–5 with bin size 0.5 yields 10 bins).
- Rule 3: Start on multiples of the chosen bin size (e.g., 0, 0.5, 1.0, …). Avoid starting on a non-round lower bound like 0.03.
- It’s acceptable for some bins to be empty.
- There are cases where a non-10-bin setup is better for answering specific questions; the general guideline is to use around 10 bins for shape exploration.
- Alternative bin counts (8, 12, etc.) can be used depending on the question at hand.
- In practice, you should select a bin size that yields a clean, round-numbered bin range and ensure the lower bound starts at a round number.
- Example discussion: Harvard GPAs with a range 0–4 would suggest 10 bins with a width of 0.4 if strictly following 0–4, but the instructor recommends a round-number bin width (e.g., 0.5) and starting at a round lower bound to get ~10 bins.
- Notes on interpretation: some bins may be very sparse or empty; the key is to capture shape rather than perfection.
- Special cases: some questions may benefit from non-standard bin counts or bin sizes depending on the distribution and the precise question asked.
- Quick recap: 10-bin starting point, round bin width, lower bound on a round number; adjust as needed for clarity and the question.
Floor and Ceiling Effects and Skew
- Floor effect: many observations pile up at the minimum possible value (lower bound).
- Ceiling effect: many observations pile up at the maximum possible value (upper bound).
- Both floor and ceiling effects can produce skew; however, skew can also occur without explicit floor/ceiling effects if there is a long tail in one direction.
- You can have distributions with fixed bounds that are still skewed if most data lie away from the bounds and tails extend in one direction.
- Bowling scores example: bowling has a hard upper limit (300) but it’s not common to hit that, so skew isn’t necessarily due to a ceiling in practice.
- If both a floor and ceiling are heavily populated, the distribution could become bimodal; depending on which pile is larger and the tail directions, the skew could be negative or positive.
- Practical exam tip: when asked to describe shape, you should mention modality (unimodal/bimodal) and skew (symmetric, left-skewed, right-skewed).
Scale vs Ordinal Data and Ranking
- Ordinal data: conveys order but not equal spacing between ranks (e.g., rank 1, rank 2, rank 3 without implying equal distances).
- Scale data (interval/ratio): distances between values are meaningful and uniform (e.g., lengths, GPA scale where differences are consistent).
- Example discussion from Tyler: a grade rank column is ordinal if only the order is known; if the distances between ranks are meaningful and uniform, it could be scaled.
- Distinguishing features:
- Ordinal: order matters; distances unknown or not equal.
- Scale: distance between values is meaningful and constant.
- Practical takeaway: when you need to compare central tendency or variability, know whether your data are ordinal or scaled, because that changes which statistical summaries are appropriate.
- The instructor emphasizes asking questions about levels of measurement as a core skill for data interpretation.
Small Sample Sizes, Grades, and Data Interpretation
- Acknowledge that small samples can be misleading for interpreting distributions or making inferences.
- Canvas grade visibility: early in the semester, grade displays may be unstable due to small sample sizes; expect the grade to be “hidden” or not fully representative until more data are collected.
- The instructor plans to unlock grade visibility later in the term once data become more representative.
- Advice: maximize performance across quizzes, homework, and participation to avoid misinterpretation due to small-sample noise.
- Excel and grading practice: the course teaches how to download and compute grades in Excel as a learning exercise.
Excel Review: Getting Started with the Movie Data Workbook
- Data file: a workbook with multiple movie-related columns used for five weeks of practice.
- Initial columns include:
- Movie title
- MPAA rating (e.g., G, PG, PG-13, R)
- Ratings from Rotten Tomatoes critics (RT critic), Rotten Tomatoes fans (RT fan), IMDB, and Metacritic
- A synthetic column called Rick rate for demonstration purposes
- Box office revenue
- A set of “flags” (binary columns) indicating movie type (action, animation, family, drama, etc.)
- For today’s tasks: delete all the flag columns and focus on the core data (title, rating, critic/fan/IMDb/Metacritic scores, box office).
- Formatting steps demonstrated:
- Bold headers for columns (Row 1) using Ctrl+B or the Bold option
- Auto-adjust column width to fit contents
- Center-align rating columns for readability
- Sort by year released (oldest to newest) with data-wide selection to keep rows intact
- Use Custom Sort to ensure the entire row is sorted by the chosen column (e.g., Year)
- If Excel misidentifies headers, check the "My data has headers" option to restore header rows and proper sorting
- Saving and exporting:
- Save progress frequently (File > Save)
- Save as PDF with consideration for readability: prefer "Fit to paper width" so that all columns fit on one or a few pages, avoiding overly small text
- If Adobe Acrobat is installed, you can Save as Adobe PDF; otherwise, use Print to PDF with scaling options to fit columns to one page width
- Homework formatting rules:
- All answers must be typed; handwriting is not allowed to speed up feedback and consistency
- Do not change the document’s formatting, indentation, or color; keep the instructor’s formatting conventions
- Do not start a new document or alter the structure; consistent layout aids grading and feedback
- Non-compliant submissions may lose credit due to grading inefficiency
- Excel formulas always start with an equals sign (=).
- Example: computing the minimum RT critic value manually vs. using a function.
- Basic approach using built-in functions (to explore):
- MIN: finds the smallest value in a range
- MAX: finds the largest value in a range
- SUM: sum of values in a range
- AVERAGE: arithmetic mean; note the instructor’s point that there is no dedicated MEAN function; in Excel, AVERAGE computes the mean
- MEDIAN: middle value of the ordered data
- MODE.SINGLE (or MODE in older versions): most frequent value
- Naming and discovering functions:
- Try typing a function name (e.g., MIN, MAX, SUM, MEDIAN, MODE) to see what Excel suggests
- If you’re unsure, Google the function name to learn the exact syntax and arguments
- Cell addressing basics:
- A cell address has a column letter and a row number, e.g., A2, C4
- Example: =A2 copies the value from cell A2 into the current cell; Excel updates automatically when A2 changes
- Ranges (rectangular blocks of cells):
- Represented as upper-left cell : lower-right cell, e.g., C2:E7
- A range like C:C refers to the entire column C
- For a formula: MIN(C:C) computes the minimum value across all of column C
- Demonstrating ranges and formulas:
- To find the minimum RT critic value across all rows: ext{MIN}(C:C)
- To find the maximum RT critic value across all rows: ext{MAX}(C:C)
- If you type a range like C2:C4 and drag the formula across columns, Excel will automatically adjust the column references (relative addressing)
- Dragging and copying formulas across columns and rows:
- When you drag a formula across columns, Excel updates the column references (C -> D -> E, etc.)
- Dragging down rows updates row references accordingly
- Central tendency in Excel:
- Mean (arithmetic average): ar{x} = rac{1}{n} rac{ ext{sum of values}}{1}, computed in Excel as ext{AVERAGE}( ext{range})
- Median: ext{Median}( ext{range})
- Mode: ext{MODE.SINGLE}( ext{range}) (or MODE; mode multi may return multiple values)
- Example worksheet tasks:
- Compute the minimum RT critic value using MIN(C:C)
- Compute the maximum RT critic value using MAX(C:C)
- Compute the minimum RT fan value using MIN(D:D)
- Compute the minimum IMDb value using MIN(E:E)
- Drag formulas across to replicate for all rating columns (RT critic, RT fan, IMDb, etc.) so each column gets its own min/max
- Compute the mean/median/mode for each rating column using AVERAGE, MEDIAN, and MODE.SINGLE respectively
- Important note from the instructor:
- There is no dedicated MEAN function in Excel; use AVERAGE for the mean and remember the mean is the arithmetic average
- If a distribution has multiple modes, MODE.SINGLE may return a single mode; MODE.MULT can return multiple modes
- Practical data-check tips:
- If a dataset changes (new rows added), formulas that reference entire columns (C:C) automatically update
- Always verify which ranges your formulas are referencing when dragging or extending data
Quick Takeaways and Real-World Relevance
- The plan is to integrate theoretical understanding of distributions with hands-on Excel tools to analyze real data.
- Being able to describe distribution shapes (modality and skew) is foundational for interpreting summary statistics and communicating findings clearly.
- Understanding how to prepare data (formatting, headers, sorting) is essential for reliable analysis and reproducibility.
- The connection to real-world data interpretation emphasizes critical thinking, not just mechanical calculation.
- Minimum value in a data range:
ext{Min} = \, ext{min}igl(x1, x2, ext{…}, x_nigr)
or in Excel: ext{MIN}( ext{range}) - Maximum value in a data range:
ext{Max} = \, ext{max}igl(x1, x2, ext{…}, x_nigr)
or in Excel: ext{MAX}( ext{range}) - Sum of values in a data range:
ext{Sum} = \sum{i=1}^{n} xi
or in Excel: ext{SUM}( ext{range}) - Arithmetic mean (the “mean”):
ar{x} = rac{1}{n} \sum{i=1}^{n} xi
in Excel: ext{AVERAGE}( ext{range}) - Median:
ext{Median} = ext{middle value of the ordered data}
in Excel: ext{MEDIAN}( ext{range}) - Mode:
ext{Mode} = ext{the most frequent value}
in Excel: ext{MODE.SINGLE}( ext{range}) (or MODE.MULT for multiple modes) - Range of cells (example):
- Full column: C:C
- Rectangular range: C2:E7
- Mean/central tendency and the data range concept are foundational for choosing appropriate descriptive statistics depending on shape and measurement level.
Next Steps for You
- Practice with the Movie Data workbook: delete extra columns, bold headers, adjust column widths, center alignment, sort by year, and save/export as PDF with fit-to-paper-width settings.
- Practice creating and dragging MIN, MAX, SUM, AVERAGE, MEDIAN, and MODE.SINGLE formulas across multiple rating columns to reproduce results quickly.
- Experiment with bin sizes and counts for histograms to build intuition about how shape changes with binning.
- Reflect on distribution shapes in real data scenarios and articulate the modality and skew clearly using proper statistical terminology.
- Keep communication with instructors and TAs open, and use office hours for clarifications early in the week to avoid end-of-week bottlenecks.