R Summary Statistics with summary() in iris and HOUSEDATA
Overview of summary() in R
The video explains how to compute summary statistics for numeric variables in a dataset using the summary() command in R. The command is written in lowercase as summary(...)
. When applying it, you must pass the dataset name inside parentheses and pay attention to capitalization, since R is case-sensitive. For built-in datasets like iris, you can simply run summary(iris)
and obtain summary statistics for each variable. The output for numeric variables includes the minimum, first quartile, median, mean, third quartile, and maximum. Specifically, for a numeric vector x, the summary provides:
- minimum: the smallest value, represented as ext{min}(x)
- first quartile: Q_1(x) (the 25th percentile)
- median: ext{median}(x) (the 50th percentile)
- mean: ar{x}
- third quartile: Q_3(x) (the 75th percentile)
- maximum: the largest value, represented as ext{max}(x)
In addition, if there are any missing values in the dataset, the summary will report how many are missing for each variable. For categorical variables, the summary may show counts by category or total counts depending on how the data are formatted in R (e.g., iris' Species variable can yield counts for each category, while sometimes you may see a total count).
Datasets: iris and capitalization rules
The iris dataset is pre-programmed into R, so no import steps are needed. When using a dataset directly in R, you must ensure that the dataset name is in the correct case: the command is summary(iris)
(all lowercase) and the dataset name in the environment is also in the correct case. If a dataset name is not all lowercase (e.g., a dataset called HOUSEDATA in uppercase), you must match the capitalization precisely in the command, as in summary(HOUSEDATA)
.
Working with non-preloaded datasets: importing from Excel
If you want to use a dataset that is not pre-loaded in R, such as data from Excel, you must first import it into the R environment (or Posit Cloud). The typical steps are:
- Open the Environment tab.
- Click Import Dataset.
- Choose From Excel.
- Use Browse to locate the file in your data folder and select the correct file.
- Click Open, then Import to bring the dataset into the workspace.
The video shows a common pitfall: accidentally selecting the wrong dataset (e.g., clicking ice cream instead of HOUSEDATA) and then correcting the choice before importing.
Example: The HOUSEDATA dataset and its summary
After importing the HOUSEDATA dataset, the user runs the summary command with the correct capitalization: summary(HOUSEDATA)
. The resulting statistics illustrate several key points:
- House sizes range from 1000 to 3500 square feet, with an average size of 1947 square feet.
In LaTeX terms, this is: the dataset contains numeric values with
- minimum: ext{min}( ext{size}) = 1000
- first quartile: Q_1( ext{size})
- median: ext{median}( ext{size})
- mean: ar{ ext{size}} = 1947
- third quartile: Q_3( ext{size})
- maximum: ext{max}( ext{size}) = 3500
- Prices range from 7.775 imes 10^{4} dollars to 2.845 imes 10^{5} dollars (i.e., from 77.75 thousand to 284.5 thousand dollars).
- It is important to interpret the summary statistics only for quantitative variables. In HOUSEDATA, the ID variable is numeric, but its minimum, quartiles, median, mean, and maximum do not provide meaningful population insights.
- The agency variable is also numeric but represents categories (1, 2, 3, 4). The average of these category codes, such as rac{1+2+3+4}{4} = 2.25 (or similar) does not convey a meaningful numeric interpretation about the population or the underlying categories.
Therefore, you should interpret the summary statistics for quantitative variables only, and treat categorical variables with their category counts or levels instead of numerical averages.
This example demonstrates why it’s crucial to differentiate quantitative versus categorical variables when interpreting summary outputs:
- For numeric variables, the min, Q1, median, mean, Q3, and max summarize distributional properties.
- For categorical variables, counts per category (or total counts) are the relevant summaries, and numeric encodings (like 1–4) do not imply numeric meaning unless those encodings are explicitly treated as factors.
Practical tips and caveats
- Always ensure the dataset name in the summary() command exactly matches the object in your environment, including capitalization, because R is case-sensitive.
- When importing from Excel, verify you have the correct file selected before importing, and check that the imported object appears in the Environment pane.
- Distinguish numeric (quantitative) variables from categorical ones before interpreting summary output; numeric summary statistics (min, max, quartiles, mean, median) have limited interpretive value for coded categories.
- Missing values are reported per variable in the summary output, which helps in assessing data completeness.
- If you are using a pre-loaded dataset like iris, you can rely on the built-in structure, which typically includes numeric variables (e.g., Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and a categorical Species variable with counts per category in the summary.
Quick reference: key statistics and notation
- For a numeric vector x, the summary includes:
- minimum: ext{min}(x)
- first quartile: Q_1(x) (25th percentile)
- median: ext{median}(x) (50th percentile)
- mean: ar{x}
- third quartile: Q_3(x) (75th percentile)
- maximum: ext{max}(x)
- Missing values per variable: count of NA values (if any).
- Categorical variables may display counts per category or a total count, depending on formatting and variable encoding.
Summary: practical takeaways for exam-ready understanding
- The summary() command provides a concise snapshot of a dataset’s distribution for numeric variables, including central tendency (mean, median) and spread (min, max, quartiles).
- Always check the dataset’s structure and the variable types before interpreting the results; numerical encodings for categories do not carry inherent numeric meaning.
- When importing data from Excel or other sources, ensure correct file selection and correct capitalization of dataset names to avoid errors and misinterpretation.