Stats

STAT 118 Study Set

Key Terms and Concepts

1. Definitions of Statistics

  • Statistics: The science of collecting, organizing, and interpreting data.

  • Individuals: The objects on which data are collected (e.g., students, states, hospitals).

  • Variables: Characteristics recorded about individuals.

2. Types of Variables

  • Quantitative Variables: Numeric values with meaningful operations (e.g., height, weight).

  • Categorical Variables: Groups or categories (e.g., gender, college type).

  • Identifier Variables: Unique values assigned to individuals (e.g., ID numbers).

3. Data Visualization

  • Bar Charts & Pie Charts: Represent categorical data.

  • Histograms: Display quantitative data distributions.

  • Boxplots: Compare distributions and identify outliers.

  • Dotplots & Density Plots: Represent distributions and trends.

4. Measures of Center

  • Mean (x̄): Sum of all values divided by the number of values.

  • Median (m): The middle value when data is ordered.

5. Measures of Spread

  • Range: Difference between the largest and smallest values.

  • Interquartile Range (IQR): The difference between Q3 (75th percentile) and Q1 (25th percentile).

  • Standard Deviation (S): Measures variation around the mean.

6. Standardization

  • Z-Score: Z=X−μσZ = \frac{X - \mu}{\sigma} (Measures how far a value is from the mean in standard deviations)

  • 68-95-99.7 Rule: Describes Normal Distribution percentages.

7. Relationship Between Variables

  • Explanatory Variable: The variable suspected to influence another.

  • Response Variable: The variable that is measured as an outcome.

8. Simpson’s Paradox

  • When a relationship between two variables reverses due to a lurking variable.


Important R Functions

1. Data Visualization

# Bar Chart
bargraph(~Variable, data = Dataset)

# Histogram
histogram(~Variable, data = Dataset)

# Boxplot
bwplot(Variable ~ Category, data = Dataset)

2. Summary Statistics

# Mean and Median
mean(Dataset$Variable)
median(Dataset$Variable)

# Standard Deviation
sd(Dataset$Variable)

# Interquartile Range
IQR(Dataset$Variable)

3. Normal Distribution

# Probability below a value
pnorm(value, mean = mu, sd = sigma)

# Probability above a value
1 - pnorm(value, mean = mu, sd = sigma)

# Finding percentiles
qnorm(percentile, mean = mu, sd = sigma)

STAT 118 Exam Study Sheet (Chapters 1-5)

Key Terms and Concepts

1. Types of Variables

  • Categorical (Qualitative) Variables: Describe qualities or categories (e.g., gender, college type).

  • Quantitative Variables: Numeric values with meaningful operations (e.g., height, weight).

  • Nominal Variables: Categories without a meaningful order (e.g., colors, names).

  • Ordinal Variables: Categories with a meaningful order but no consistent difference (e.g., ranking, education level).

  • Natural Variables: Ordered with meaningful differences (e.g., temperature, income).

2. Proportion vs. Percent

  • Proportion: A fraction representing part of a whole (e.g., 0.25 or 1/4).

  • Percent: A proportion multiplied by 100 (e.g., 0.25 = 25%).

3. Measures of Center & Spread

  • Mean (x̄): Average of data.

  • Median (m): Middle value when ordered.

  • Range: Max - Min.

  • Interquartile Range (IQR): Q3 - Q1 (middle 50% of data).

  • Standard Deviation (S): Measures spread around the mean.

  • Issues with SD for Outliers: SD is sensitive to outliers; extreme values heavily influence it.

4. Histograms & Distribution Shapes

  • Symmetric (Bell-shaped): Mean ≈ Median.

  • Right-skewed: Mean > Median.

  • Left-skewed: Mean < Median.

  • Uniform: Equal frequency across bins.

  • Bimodal: Two peaks.

5. Standardizing, Shifting & Scaling

  • Standardizing (Z-score): Z=X−μσZ = \frac{X - \mu}{\sigma} (Tells how many SDs a value is from the mean)

  • Shifting: Adding/subtracting a constant affects mean but not spread.

  • Scaling: Multiplying/dividing a constant affects both center and spread.

6. Normal Distribution & SD Bell Curve Percents

  • 68-95-99.7 Rule:

    • 68% within 1 SD

    • 95% within 2 SDs

    • 99.7% within 3 SDs

7. Using pnorm vs. qnorm in R

  • pnorm(x, mean, sd): Finds the probability below a value.

  • 1 - pnorm(x, mean, sd): Finds the probability above a value.

  • qnorm(percentile, mean, sd): Finds the value corresponding to a given percentile.

8. Correlation (R-Value)

  • Measures the strength of a linear relationship between two quantitative variables.

  • Ranges from -1 to 1:

    • R = 1: Perfect positive correlation.

    • R = -1: Perfect negative correlation.

    • R = 0: No linear correlation.

9. Simpson’s Paradox

  • A trend in different groups reverses when combined due to a lurking variable.

  • Example: A hospital appears to have a higher death rate overall, but when split by patient condition, it actually has a lower death rate in each category.

Key R Commands

  • Summary Statistics:

    mean(dataset$variable)
    median(dataset$variable)
    sd(dataset$variable)
    IQR(dataset$variable)
  • Histograms & Boxplots:

    histogram(~ variable, data = dataset)
    bwplot(variable ~ category, data = dataset)
  • Normal Distribution Calculations:

    pnorm(x, mean, sd)   # Probability below x
    1 - pnorm(x, mean, sd) # Probability above x
    qnorm(percentile, mean, sd) # Value at given percentile

    STAT 118 Exam Study Sheet (Chapters 1-5)

    Key Terms and Concepts

    1. Types of Variables

    • Categorical (Qualitative) Variables: Describe qualities or categories (e.g., gender, college type).

    • Quantitative Variables: Numeric values with meaningful operations (e.g., height, weight).

    • Nominal Variables: Categories without a meaningful order (e.g., colors, names).

    • Ordinal Variables: Categories with a meaningful order but no consistent difference (e.g., ranking, education level).

    • Natural Variables: Ordered with meaningful differences (e.g., temperature, income).

    2. Proportion vs. Percent

    • Proportion: A fraction representing part of a whole (e.g., 0.25 or 1/4).

    • Percent: A proportion multiplied by 100 (e.g., 0.25 = 25%).

    3. Measures of Center & Spread

    • Mean (x̄): Average of data.

    • Median (m): Middle value when ordered.

    • Range: Max - Min.

    • Interquartile Range (IQR): Q3 - Q1 (middle 50% of data).

    • Standard Deviation (S): Measures spread around the mean.

    • Issues with SD for Outliers: SD is sensitive to outliers; extreme values heavily influence it.

    4. Histograms & Distribution Shapes

    • Symmetric (Bell-shaped): Mean ≈ Median.

    • Right-skewed: Mean > Median.

    • Left-skewed: Mean < Median.

    • Uniform: Equal frequency across bins.

    • Bimodal: Two peaks.

    5. Standardizing, Shifting & Scaling

    • Standardizing (Z-score): Z=X−μσZ = \frac{X - \mu}{\sigma} (Tells how many SDs a value is from the mean)

    • Shifting: Adding/subtracting a constant affects mean but not spread.

    • Scaling: Multiplying/dividing a constant affects both center and spread.

    6. Normal Distribution & SD Bell Curve Percents

    • 68-95-99.7 Rule:

      • 68% within 1 SD

      • 95% within 2 SDs

      • 99.7% within 3 SDs

    7. Using pnorm vs. qnorm in R

    • pnorm(x, mean, sd): Finds the probability below a value.

    • 1 - pnorm(x, mean, sd): Finds the probability above a value.

    • qnorm(percentile, mean, sd): Finds the value corresponding to a given percentile.

    8. Correlation (R-Value)

    • Measures the strength of a linear relationship between two quantitative variables.

    • Ranges from -1 to 1:

      • R = 1: Perfect positive correlation.

      • R = -1: Perfect negative correlation.

      • R = 0: No linear correlation.

    9. Simpson’s Paradox

    • A trend in different groups reverses when combined due to a lurking variable.

    • Example: A hospital appears to have a higher death rate overall, but when split by patient condition, it actually has a lower death rate in each category.

    Key R Commands

    • Summary Statistics:

      mean(dataset$variable)
      median(dataset$variable)
      sd(dataset$variable)
      IQR(dataset$variable)
    • Histograms & Boxplots:

      histogram(~ variable, data = dataset)
      bwplot(variable ~ category, data = dataset)
    • Normal Distribution Calculations:

      pnorm(x, mean, sd)   # Pro

      Here’s a detailed term-definition study set based on the exam topics:


      Statistics and Data Analysis

      1. Mean (Average): The sum of all values in a dataset divided by the number of values.

      2. Median: The middle value in an ordered dataset; if even, the average of the two middle values.

      3. Mode: The most frequently occurring value(s) in a dataset.

      4. Range: The difference between the maximum and minimum values in a dataset.

      5. Interquartile Range (IQR): The range of the middle 50% of data, calculated as Q3 - Q1.

      6. Quartiles: Values that divide a dataset into four equal parts:

        • Q1 (First Quartile): 25th percentile

        • Q2 (Median): 50th percentile

        • Q3 (Third Quartile): 75th percentile

      7. Standard Deviation (SD): A measure of how spread out the data is around the mean. A higher SD indicates more variability.

      8. Five-Number Summary: A set of five values (Min, Q1, Median, Q3, Max) that summarize a dataset.


      Outliers Detection

      1. Outliers: Data points that are significantly higher or lower than the rest of the dataset.

      2. Lower Fence: Q1−1.5×IQRQ1 - 1.5 \times IQR, used to detect low-end outliers.

      3. Upper Fence: Q3+1.5×IQRQ3 + 1.5 \times IQR, used to detect high-end outliers.


      Probability and Percentage Calculations

      1. Proportion: A fraction representing a part of a whole, often converted into a percentage.

      2. Percentage: A way to express a proportion out of 100, calculated as partwhole×100\frac{\text{part}}{\text{whole}} \times 100.

      3. Conditional Probability: The likelihood of an event occurring given that another event has already occurred (e.g., percentage of Obama supporters who were male).


      Categorical Data Analysis

      1. Frequency Table: A table that lists the number of times different categories occur in a dataset.

      2. Contingency Table: A table that shows the frequency distribution of variables to examine relationships between them.

      3. Gender Gap in Voting: A phenomenon where voting preferences differ significantly between males and females.


      Comparing Distributions

      1. Boxplot: A graphical representation of the five-number summary, useful for comparing distributions.

      2. Histogram: A bar chart representing the frequency of numerical data intervals.

      3. Symmetric Distribution: A dataset where the left and right sides of the histogram are roughly mirror images.

      4. Skewed Distribution:

      • Right-Skewed (Positive Skew): Tail is longer on the right side.

      • Left-Skewed (Negative Skew): Tail is longer on the left side.

      1. Spread/Variability: The extent to which data values differ, measured by range, IQR, and standard deviation.


      Data Visualization and Alternative Representations

      1. Bar Chart: A chart that uses bars to represent categorical data.

      2. Scatterplot: A graph of plotted points that show the relationship between two variables.

      3. Alternative Graphical Representations: Other ways to display data, such as side-by-side boxplots for comparing distributions.


      Software-Specific Knowledge (R Programming Basics)

      1. favstats(): An R function that provides summary statistics (mean, median, Q1, Q3, etc.) for a dataset.

      2. histogram(): An R function that generates a histogram to visualize numerical data distributions.

      3. tally(): An R function that creates frequency tables for categorical data.

      4. bwplot(): An R function that generates boxplots to compare distributions of a variable across different categories.

robot