Intro to Data and Statistics — Core Concepts and Examples

Data and statistics foundations
- Data are information collected from observations, surveys, or experiments; they are the building blocks of statistical studies.
- Statistics is the field focused on data: collecting, describing, and analyzing data.
- Data are the most basic units of information used in statistics.
Course structure overview (from the textbook):
- Chapter 1: proper ways to collect data.
- Chapter 2: describe data by summarizing and visualizing it.
- Remaining chapters: how to appropriately analyze data.
Why study statistics?
- Data are everywhere; the volume of new technical information doubles roughly every two years.
- Estimated amount of unique new information generated this year: $7.2\times 10^{21}$ (7.2 zettabytes).
- Sources of data include:
- Social media participation
- Grocery-store swipe cards
- Political polls
- Amazon purchase history
- Health data (blood pressure at doctor visits)
- Sports statistics
- Google search history
- Etc.
- Your daily life creates an extensive data trail regardless of your field or interests; decisions are often based on data or evaluated based on data from others.
- Contemporary context: in 2009, the New York Times highlighted statistics as essential for today’s graduates (article reference: February 2009).
Vocabulary and notation: learning statistics entails new terms and symbols; flashcards are recommended for mastery.
- A practical study approach: study vocabulary in both directions (definition-to-term and term-to-definition) and mix up the deck periodically.
Core concepts: cases, variables, and datasets
- Case: the subject or object that information is obtained about in a dataset.
- Variable: a characteristic that corresponds to each case in a dataset.
- Dataset structure: rows represent cases; columns represent variables.
Illustrative example: do comedies or dramas have higher audience ratings?
- Cases: the movies themselves (we collect data about movies).
- Variables: two variables in this simple example:
- Genre: comedy or drama
- Audience rating: numerical rating for each movie
More complex dataset example: Countries of the world
- Leftmost column lists cases: the countries.
- Other columns describe characteristics of each case: land area, population, percent rural, health, Internet, birth rate, life expectancy, HIV rates.
- Assessment questions to reinforce the definitions:
- Q1: In the countries of the world dataset, what are the cases?
  - Answer: Countries (the subjects we obtain information about).
- Q2: In the same dataset, which options are the variables? (Choose all that apply.)
  - Options: a) countries, b) land areas, c) populations, d) percent rural, e) health, f) all of the above, g) just answers b through e
  - Explanation:
  - Countries are the cases, not variables.
  - Land areas, populations, percent rural, and health are characteristics measured for each country, i.e., variables.
  - Therefore, the correct choice is g (b through e).
Data presentation and variability
- In the countries dataset, data is shown in a table with the leftmost column as cases and the top row as variables; the body contains the data.
- Not all data are numeric; the kidney cancer map example shows numeric rates, but other datasets may include non-numeric values.
Additional datasets and interpretation
- Elf dataset example (from the textbook): all data are numeric in that particular example, but that is not guaranteed for all datasets.
- Intro to Statistics survey dataset: used as another example to illustrate cases vs variables.
- Question: what are the cases in this survey dataset? (Answer: the intro to statistics students; each student is a case.)
- Question: which of the options are the variables? (All listed columns are variables: year, gender, SAT, GPA, number of siblings, etc.)
- For the intro to statistics survey dataset, each row corresponds to one student; the title indicates that the data were recorded for each student, so the cases are the students.
Kidney cancer death-rate maps: interpreting cases and values
- Visuals show counties with the highest and lowest kidney cancer death rates.
- A common pattern observed: central and mountain zones have more counties shaded (higher rates), while eastern and pacific zones have fewer.
- One possible explanation discussed: small sample sizes can lead to more extreme results; counties with smaller populations may show more extreme rates.
- Clarification on what constitutes a case in this dataset:
- If the values are the rates (e.g., death rates) for each county, then the cases are the counties themselves.
- If you instead treat yes/no outcomes (e.g., whether a county has kidney cancer deaths above a threshold), the appropriate cases might be the people living in the United States rather than counties.
Categorical vs quantitative variables
- Categorical variable: divides cases into groups, with each case belonging to one category (two or more categories).
- Quantitative variable: measures or records a numerical quantity for each case.
- Application to datasets:
- In the intro stats datasets, the following classifications were discussed:
  - Year: non-numeric and categorical (e.g., freshman, sophomore, junior, senior) because it groups cases.
  - Gender: categorical.
  - Higher SAT: categorical (as per the example narrative); SAT itself is numeric in many contexts; GPA, height, weight, etc. are numeric (quantitative).
  - The kidney cancer example depends on the definition of cases:
  - If cases are counties, the measured variable (rate) is quantitative.
  - If cases are individuals and the data are yes/no, the measured variable is categorical.
- Key takeaway: whether a variable is categorical or quantitative can depend on what you define as the cases(es) you are studying.
Explanatory vs. response variables
- Explanatory variable: a variable used to help understand or predict values of another variable.
- Response variable: the outcome you want to understand or predict.
- Examples from the transcript:
- Meditation and stress: explanatory = meditation; response = stress level (we want to see if meditation affects stress).
- Sugar consumption and hyperactivity: explanatory = sugar consumption; response = hyperactivity.
- Yogurt consumption and weight loss: explanatory = eating yogurt; response = weight loss.
- Wearing red and attraction: explanatory = wearing red; response = attraction.
- Note: the explanatory variable is not necessarily the one mentioned first in a research question; the roles depend on what you are trying to understand or predict.
Summary and practical implications
- Data are ubiquitous and can describe almost anything you study or observe.
- A dataset consists of variables measured on cases; variables can be categorical or quantitative.
- Proper interpretation depends on correctly identifying the cases and the variables, and recognizing that the same data can be framed differently depending on the research question.
- Practical implications include the need to be mindful of sample size (small samples can yield extreme results), data source quality, and the appropriate assignment of cases and variables when modeling relationships.
Quick reference formulas and numerical notes
- Data growth context: data volume doubles roughly every two years.
- Notation for the annual data magnitude mentioned: $7.2\times 10^{21}$ (7.2 zettabytes).
Study strategies emphasized in the transcript
- Use flashcards to learn vocabulary and notation.
- Review terms in both directions (term → definition and definition → term).
- Regularly mix up the deck to reinforce retrieval.
Real-world relevance
- Data-informed decision making spans virtually all disciplines and careers; understanding how to identify cases, variables, and the proper type of data is foundational for analyzing and interpreting information accurately.