Statistics and Excel Review — Lecture Notes

Plan for Today

Q&A and review based on feedback surveys from last week
Review of what we learned last week in Excel
Learn some new Excel skills
Get you ready for the new semester

Meet Tyler: Your AI TA

Tyler is introduced as the course’s AI assistant (TA) and has worked with the instructor for over a year.
Tyler is a resource alongside the instructor for questions about homework or general course navigation.
Office hours are by appointment and flexible; you can email with questions.
The instructor and Tyler are available as resources; you can reach out to either one for help.
The instructor emphasizes responsiveness to feedback surveys and personal replies to student feedback.

Office Hours, Communication, and Feedback

Office hours are by appointment; email questions are encouraged.
Feedback surveys: the instructor reads all comments and replies; students should not just report a numeric score but engage with the feedback.
If you’re struggling, contact either the instructor or Tyler early rather than waiting.
Clarification: questions about homework should be directed to office hours or email, not to the lecture feedback section.

Wednesdays: Structure and Strategy

Wednesdays are busy: lecture to watch, participation quiz, lecture feedback survey, and a homework due.
Advice: avoid procrastinating on homework; weekend or Monday time blocks are best for completion.
If you don’t have questions about homework, you can still submit it on Wednesday; if you have questions, use Monday or Tuesday office hours.
The goal is to avoid ending up with a heavy workload on Wednesday night.
Workshops are recorded and available on Canvas; you can review them later.
The instructor notes that some formatting tips (e.g., coloring, bolding) may be covered in workshops; students should not rely on Google for everything—strong guidance is provided.

Workshops and Recordings

Workshops are recorded and accessible on Canvas.
Most homework content is covered in some form in the workshops; there are hints and formatting options discussed.
If you miss class, you can review the workshop recordings to fill in gaps.

Questions, Questions, Questions

The instructor encourages asking questions as a sign of learning, bravery, and academic citizenship.
Asking questions helps classmates too because it provides answers or clarifications that others may benefit from.
Students should feel free to ask about any topic, not just the skew or bins.

Skew, Multimodality, and Distribution Shapes

Skew: not symmetric; defined by the tail direction, not where most data lies.
Positive skew (skewed to the right): tail extends to the positive direction.
Negative skew (skewed to the left): tail extends to the negative direction.
A quick classroom trick (a playful stat-impression): narrate yourself as the histogram and point to the tail; practice makes the concept intuitive.
When asked to imagine a distribution’s shape, you should describe modality (unimodal, bimodal) and skew (positive, negative, or symmetric).
A Kilroy’s example: describe a plausible distribution of drinks per night; you can propose unimodal with a peak around a central value and tails on both sides, but the exact shape may vary depending on reasoning.
The instructor accepts reasonable, well-explained descriptions even if they don’t match a single “correct” shape.
A Harvard GPA example: normal GPA range is 0–4; discussion of ceiling (near 4) and floor (near 0) effects; expected data pile near the top (ceiling) with tails downward and upward depending on student outcomes.
Ceiling effect: data pile up at the upper limit; floor effect: data pile up at the lower limit.
If both floor and ceiling effects occur, the distribution could be bimodal and potentially symmetric depending on the relative pile sizes.
Practical reflections: even with fixed bounds, data can be skewed without a hard floor/ceiling effect if tails exist.
Real-world relevance: understanding distribution shape helps interpret means, medians, modes, and informs critical evaluation of statistics in real-world claims.
Answering shape questions well requires articulating a plausible data-generating process and using proper terminology.
Questions about data shapes are tied to later topics such as mean, median, and mode, and how they respond to distribution shape.
The value of imagining shapes: helps you be a critical consumer of statistics and recognize misleading presentations.
Practical note: small samples can misrepresent the true shape and affect early course grades; be mindful of sample size when interpreting grades or feedback.

Bins, Histograms, and Data Binning Rules

Bins are the intervals that histograms use to group data.
Three practical rules for bins discussed:
- Rule 1: Start with about 10 bins to get a good sense of shape; 10 is a starting point, not a hard rule.
- Rule 2: Choose a bin size that is a nice round number and makes the range coverable in roughly 10 bins (e.g., range 0–5 with bin size 0.5 yields 10 bins).
- Rule 3: Start on multiples of the chosen bin size (e.g., 0, 0.5, 1.0, …). Avoid starting on a non-round lower bound like 0.03.
It’s acceptable for some bins to be empty.
There are cases where a non-10-bin setup is better for answering specific questions; the general guideline is to use around 10 bins for shape exploration.
Alternative bin counts (8, 12, etc.) can be used depending on the question at hand.
In practice, you should select a bin size that yields a clean, round-numbered bin range and ensure the lower bound starts at a round number.
Example discussion: Harvard GPAs with a range 0–4 would suggest 10 bins with a width of 0.4 if strictly following 0–4, but the instructor recommends a round-number bin width (e.g., 0.5) and starting at a round lower bound to get ~10 bins.
Notes on interpretation: some bins may be very sparse or empty; the key is to capture shape rather than perfection.
Special cases: some questions may benefit from non-standard bin counts or bin sizes depending on the distribution and the precise question asked.
Quick recap: 10-bin starting point, round bin width, lower bound on a round number; adjust as needed for clarity and the question.

Floor and Ceiling Effects and Skew

Floor effect: many observations pile up at the minimum possible value (lower bound).
Ceiling effect: many observations pile up at the maximum possible value (upper bound).
Both floor and ceiling effects can produce skew; however, skew can also occur without explicit floor/ceiling effects if there is a long tail in one direction.
You can have distributions with fixed bounds that are still skewed if most data lie away from the bounds and tails extend in one direction.
Bowling scores example: bowling has a hard upper limit (300) but it’s not common to hit that, so skew isn’t necessarily due to a ceiling in practice.
If both a floor and ceiling are heavily populated, the distribution could become bimodal; depending on which pile is larger and the tail directions, the skew could be negative or positive.
Practical exam tip: when asked to describe shape, you should mention modality (unimodal/bimodal) and skew (symmetric, left-skewed, right-skewed).

Scale vs Ordinal Data and Ranking

Ordinal data: conveys order but not equal spacing between ranks (e.g., rank 1, rank 2, rank 3 without implying equal distances).
Scale data (interval/ratio): distances between values are meaningful and uniform (e.g., lengths, GPA scale where differences are consistent).
Example discussion from Tyler: a grade rank column is ordinal if only the order is known; if the distances between ranks are meaningful and uniform, it could be scaled.
Distinguishing features:
- Ordinal: order matters; distances unknown or not equal.
- Scale: distance between values is meaningful and constant.
Practical takeaway: when you need to compare central tendency or variability, know whether your data are ordinal or scaled, because that changes which statistical summaries are appropriate.
The instructor emphasizes asking questions about levels of measurement as a core skill for data interpretation.

Small Sample Sizes, Grades, and Data Interpretation

Acknowledge that small samples can be misleading for interpreting distributions or making inferences.
Canvas grade visibility: early in the semester, grade displays may be unstable due to small sample sizes; expect the grade to be “hidden” or not fully representative until more data are collected.
The instructor plans to unlock grade visibility later in the term once data become more representative.
Advice: maximize performance across quizzes, homework, and participation to avoid misinterpretation due to small-sample noise.
Excel and grading practice: the course teaches how to download and compute grades in Excel as a learning exercise.

Excel Review: Getting Started with the Movie Data Workbook

Data file: a workbook with multiple movie-related columns used for five weeks of practice.
Initial columns include:
- Movie title
- MPAA rating (e.g., G, PG, PG-13, R)
- Ratings from Rotten Tomatoes critics (RT critic), Rotten Tomatoes fans (RT fan), IMDB, and Metacritic
- A synthetic column called Rick rate for demonstration purposes
- Box office revenue
- A set of “flags” (binary columns) indicating movie type (action, animation, family, drama, etc.)
For today’s tasks: delete all the flag columns and focus on the core data (title, rating, critic/fan/IMDb/Metacritic scores, box office).
Formatting steps demonstrated:
- Bold headers for columns (Row 1) using Ctrl+B or the Bold option
- Auto-adjust column width to fit contents
- Center-align rating columns for readability
- Sort by year released (oldest to newest) with data-wide selection to keep rows intact
- Use Custom Sort to ensure the entire row is sorted by the chosen column (e.g., Year)
- If Excel misidentifies headers, check the "My data has headers" option to restore header rows and proper sorting
Saving and exporting:
- Save progress frequently (File > Save)
- Save as PDF with consideration for readability: prefer "Fit to paper width" so that all columns fit on one or a few pages, avoiding overly small text
- If Adobe Acrobat is installed, you can Save as Adobe PDF; otherwise, use Print to PDF with scaling options to fit columns to one page width
Homework formatting rules:
- All answers must be typed; handwriting is not allowed to speed up feedback and consistency
- Do not change the document’s formatting, indentation, or color; keep the instructor’s formatting conventions
- Do not start a new document or alter the structure; consistent layout aids grading and feedback
- Non-compliant submissions may lose credit due to grading inefficiency

Excel Formulas and Cell References: Getting Started

Excel formulas always start with an equals sign (=).
Example: computing the minimum RT critic value manually vs. using a function.
Basic approach using built-in functions (to explore):
- MIN: finds the smallest value in a range
- MAX: finds the largest value in a range
- SUM: sum of values in a range
- AVERAGE: arithmetic mean; note the instructor’s point that there is no dedicated MEAN function; in Excel, AVERAGE computes the mean
- MEDIAN: middle value of the ordered data
- MODE.SINGLE (or MODE in older versions): most frequent value
Naming and discovering functions:
- Try typing a function name (e.g., MIN, MAX, SUM, MEDIAN, MODE) to see what Excel suggests
- If you’re unsure, Google the function name to learn the exact syntax and arguments
Cell addressing basics:
- A cell address has a column letter and a row number, e.g., A2, C4
- Example: =A2 copies the value from cell A2 into the current cell; Excel updates automatically when A2 changes
Ranges (rectangular blocks of cells):
- Represented as upper-left cell : lower-right cell, e.g., C2:E7
- A range like C:C refers to the entire column C
- For a formula: MIN(C:C) computes the minimum value across all of column C
Demonstrating ranges and formulas:
- To find the minimum RT critic value across all rows: ext{MIN}(C:C)
- To find the maximum RT critic value across all rows: ext{MAX}(C:C)
- If you type a range like C2:C4 and drag the formula across columns, Excel will automatically adjust the column references (relative addressing)
Dragging and copying formulas across columns and rows:
- When you drag a formula across columns, Excel updates the column references (C -> D -> E, etc.)
- Dragging down rows updates row references accordingly
Central tendency in Excel:
- Mean (arithmetic average): ar{x} = rac{1}{n} rac{ ext{sum of values}}{1}, computed in Excel as ext{AVERAGE}( ext{range})
- Median: ext{Median}( ext{range})
- Mode: ext{MODE.SINGLE}( ext{range}) (or MODE; mode multi may return multiple values)
Example worksheet tasks:
- Compute the minimum RT critic value using MIN(C:C)
- Compute the maximum RT critic value using MAX(C:C)
- Compute the minimum RT fan value using MIN(D:D)
- Compute the minimum IMDb value using MIN(E:E)
- Drag formulas across to replicate for all rating columns (RT critic, RT fan, IMDb, etc.) so each column gets its own min/max
- Compute the mean/median/mode for each rating column using AVERAGE, MEDIAN, and MODE.SINGLE respectively
Important note from the instructor:
- There is no dedicated MEAN function in Excel; use AVERAGE for the mean and remember the mean is the arithmetic average
- If a distribution has multiple modes, MODE.SINGLE may return a single mode; MODE.MULT can return multiple modes
Practical data-check tips:
- If a dataset changes (new rows added), formulas that reference entire columns (C:C) automatically update
- Always verify which ranges your formulas are referencing when dragging or extending data

Quick Takeaways and Real-World Relevance

The plan is to integrate theoretical understanding of distributions with hands-on Excel tools to analyze real data.
Being able to describe distribution shapes (modality and skew) is foundational for interpreting summary statistics and communicating findings clearly.
Understanding how to prepare data (formatting, headers, sorting) is essential for reliable analysis and reproducibility.
The connection to real-world data interpretation emphasizes critical thinking, not just mechanical calculation.

Quick Formulas Recap (LaTeX)

Minimum value in a data range:
ext{Min} = \, ext{min}igl(x1, x2, ext{…}, x_nigr)
or in Excel: ext{MIN}( ext{range})
Maximum value in a data range:
ext{Max} = \, ext{max}igl(x1, x2, ext{…}, x_nigr)
or in Excel: ext{MAX}( ext{range})
Sum of values in a data range:
ext{Sum} = \sum{i=1}^{n} xi
or in Excel: ext{SUM}( ext{range})
Arithmetic mean (the “mean”):
ar{x} = rac{1}{n} \sum{i=1}^{n} xi
in Excel: ext{AVERAGE}( ext{range})
Median:
ext{Median} = ext{middle value of the ordered data}
in Excel: ext{MEDIAN}( ext{range})
Mode:
ext{Mode} = ext{the most frequent value}
in Excel: ext{MODE.SINGLE}( ext{range}) (or MODE.MULT for multiple modes)
Range of cells (example):
- Full column: C:C
- Rectangular range: C2:E7
Mean/central tendency and the data range concept are foundational for choosing appropriate descriptive statistics depending on shape and measurement level.

Next Steps for You

Practice with the Movie Data workbook: delete extra columns, bold headers, adjust column widths, center alignment, sort by year, and save/export as PDF with fit-to-paper-width settings.
Practice creating and dragging MIN, MAX, SUM, AVERAGE, MEDIAN, and MODE.SINGLE formulas across multiple rating columns to reproduce results quickly.
Experiment with bin sizes and counts for histograms to build intuition about how shape changes with binning.
Reflect on distribution shapes in real data scenarios and articulate the modality and skew clearly using proper statistical terminology.
Keep communication with instructors and TAs open, and use office hours for clarifications early in the week to avoid end-of-week bottlenecks.