CMSC 320 Exam 1 Content

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/83

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

84 Terms

1
New cards

Hypothesis Testing

A statistical method to make informed decisions about a population based on sample data.

2
New cards

null hypothesis (H₀)

A statement of no effect or no difference, assumed true until evidence suggests otherwise.

3
New cards

alternative hypothesis (H₁ or Hₐ)

A statement that contradicts the null hypothesis, supported if evidence is strong.

4
New cards

What does statistical significance mean?

The observed result is unlikely under the null hypothesis, often determined by a p-value less than α.

5
New cards

p-value

The probability of observing a test statistic as extreme or more, assuming the null hypothesis is true.

6
New cards

significance level (α)

A threshold for rejecting H₀, typically set at 0.05 or 0.01.

7
New cards

Type I Error

Incorrectly rejecting a true null hypothesis (false positive).

8
New cards

Type II Error

Failing to reject a false null hypothesis (false negative).

9
New cards

random sampling

Every member of the population has an equal chance of selection.

10
New cards

stratified sampling

Divide the population into subgroups and randomly sample from each subgroup.

11
New cards

cluster sampling

Divide the population into groups (clusters), randomly select some clusters, and include all members of selected clusters.

12
New cards

systematic sampling

Select every kth individual from a list, starting at a random point.

13
New cards

convenience sampling

Select individuals who are easiest to reach; may introduce bias.

14
New cards

critical region

The range of values where the null hypothesis is rejected.

15
New cards

critical value

The boundary that separates the critical region from the rest of the distribution.

16
New cards

How do you decide whether to reject H₀ using a test statistic?

Compare the test statistic to the critical value or the p-value to α.

17
New cards

one-tailed test

A test that checks for an effect in only one direction (e.g., μ > μ₀ or μ < μ₀).

18
New cards

two-tailed test

A test that checks for an effect in both directions (μ ≠ μ₀).

19
New cards

When do you use a Z-test?

When sample size is large (n > 30) and the population standard deviation is known.

20
New cards

When do you use a T-test?

When the sample size is small (n < 30) and the population standard deviation is unknown.

21
New cards

paired T-test

A test for comparing means of the same group at two points in time or under two conditions.

22
New cards

What is a chi-square test used for?

To test if two categorical variables are related or come from the same distribution.

23
New cards

What is ANOVA used for?

To compare the means of three or more groups for significant differences.

24
New cards

What are post hoc tests and when are they used?

Used after ANOVA to determine which specific group means differ significantly.

25
New cards

Probability

A measure between 0 and 1 that describes the likelihood of an event occurring.

26
New cards

What is Bayes' Rule used for?

To calculate the probability of a hypothesis based on prior knowledge and new evidence.

27
New cards

Conditional probability

The probability of an event occurring given that another event has already occurred.

28
New cards

Conditional Independence

When the occurrence of one event does not affect the probability of another, given a third event.

29
New cards

Law of Total Probability

A formula that finds the total probability of an event based on all the different ways it can happen

30
New cards

Expected Value

The long-run average or mean value of repetitions of a random variable.

31
New cards

Uniform Distribution

A distribution where all outcomes are equally likely.

32
New cards

normal (Gaussian) distribution

A bell-shaped, symmetric distribution defined by a mean (μ) and standard deviation (σ).

33
New cards

What does the standard deviation (σ) control in a normal distribution?

It controls the spread; a smaller σ means a narrower peak, and a larger σ means a wider spread.

34
New cards

Z-score

A standardized score that tells how many standard deviations a value is from the mean.

35
New cards

Central Limit Theorem (CLT)

The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the original population distribution.

36
New cards

Three main measures of central tendency

Mean, median, and mode.

37
New cards

When is the geometric mean preferred over the arithmetic mean?

For data involving growth rates, ratios, or percentages.

38
New cards

Weighted Average

An average where each value contributes according to its importance or frequency.

39
New cards

two common measures of variability

Variance and standard deviation.

40
New cards

How do the mean and median behave in skewed data?

The mean is pulled toward outliers; the median is more robust.

41
New cards

What does standard deviation tell us?

How spread out the values are around the mean; higher values indicate greater spread.

42
New cards

Experimental design

The process of planning, conducting, and analyzing experiments to test a hypothesis and ensure reliable, unbiased conclusions.

43
New cards

Main steps of experimental design

  1. Define the problem

  2. Identify variables and population/sample

  3. Formulate a hypothesis

  4. Control for confounding variables

  5. Choose data collection method

  6. Analyze and conclude

44
New cards

Independent variable (IV)

The variable that is manipulated or changed in an experiment to observe its effect.

45
New cards

Dependent variable (DV)

The outcome that is measured in an experiment; it depends on the IV.

46
New cards

Population vs. sample.

  • Population: Entire group of interest

  • Sample: Subset of the population used for study

47
New cards

Hypothesis

A testable explanation predicting the relationship between variables (e.g., "If X, then Y").

48
New cards

Optimization criterion

A goal or objective (like maximizing CTR or accuracy) used to evaluate outcomes.

49
New cards

Confounding variable

An external factor that may influence the DV and distort results if not controlled.

50
New cards

How can you control confounding variables?

  • Hold variables constant

  • Randomization (RCTs)

  • Replication

  • Stratified Randomization

  • Block Design (Matched Pair)

51
New cards

Randomization

Randomly assigning participants to groups to minimize systematic bias.

52
New cards

Replication

Repeating the experiment to confirm reliability and reduce the effect of anomalies.

53
New cards

Stratified randomization

Grouping subjects by confounders (e.g., prior knowledge), then randomizing within each group.

54
New cards

Block design (matched pair)

Pairing similar individuals and assigning one to control and one to treatment to isolate effects.

55
New cards

Four main methods of data collection

  1. Observational studies

  2. Surveys

  3. Experiments

  4. Simulations

56
New cards

Types of observational studies

  • Cross-sectional

  • Retrospective (case-control)

  • Prospective (cohort)

57
New cards

Placebo effect

Improvement due to belief in treatment, not the treatment itself.

58
New cards

Blinding

A method to reduce bias where participants and/or researchers don’t know group assignments.

59
New cards

Single vs double blinding

  • Single-blind: Either participants or researchers are blind

  • Double-blind: Both are blind

60
New cards

When would you use a simulation for data collection?

When real-world testing is too expensive, dangerous, or impractical.

61
New cards

Fundamental rule of data collection

Your data must be representative of the population you want to study.

62
New cards

Data cleaning

The process of removing or correcting inaccurate, incomplete, or irrelevant data from a dataset.

63
New cards

Why is data cleaning important?

It ensures data quality, improves analysis reliability, and prepares data for modeling or decision-making.

64
New cards

What are duplicated records and how are they handled?

Duplicate rows that may be exact or slightly different. Use df.drop_duplicates() for exact matches; others require manual review.

65
New cards

Evolving labeling schemes

Changes in data categories or labels over time (e.g., "Good" becomes "Very Good"). Handle by remapping or splitting data by time periods.

66
New cards

Outlier

A value significantly different from the rest, typically >2 or <−2 standard deviations from the mean.

67
New cards

How is a z-score used in outlier detection?

It measures how far a data point is from the mean in standard deviations; extreme z-scores suggest outliers.

68
New cards

Should outliers always be removed?

No—only remove if they distort analysis and are not meaningful for the problem at hand.

69
New cards

What are the types of missing data?

MCAR (Missing Completely at Random), MAR (Missing at Random), and MNAR (Missing Not at Random).

70
New cards

MCAR

Missingness unrelated to any observed or unobserved data; purely random.

71
New cards

MAR

Missingness related to other observed variables, not the missing values themselves.

72
New cards

MNAR

Missingness is related to the missing values themselves; most difficult to address.

73
New cards

How do you handle MCAR data?

Use listwise deletion, pairwise deletion, or simple imputations like mean, median, or mode.

74
New cards

How do you handle MAR data?

Use regression, KNN imputation, multiple imputation, or ML models trained on observed variables.

75
New cards

How do you handle MNAR data?

Use sensitivity analysis, pattern mixture models, or domain knowledge to impute or analyze separately.

76
New cards

Listwise deletion

Remove entire rows with any missing values.

77
New cards

pairwise deletion

Use all available data for each analysis; only exclude missing values per variable involved.

78
New cards

When is mean imputation appropriate?

For numeric data that is not skewed and has <5% missingness.

79
New cards

When is mode imputation used?

For categorical data or when a clear most frequent value exists.

80
New cards

hot-deck imputation

Replace missing values using values from similar cases within the dataset.

81
New cards

multiple imputation

Create several plausible versions of the dataset with different imputations, analyze separately, then combine results.

82
New cards

boundary conditions

Data constraints imposed by instruments or systems (e.g., a sensor that can't read below −10°C).

83
New cards

How can you detect incorrect data?

Look for attractors, discontinuities, extreme or impossible values, and data outside valid ranges.

84
New cards

How do instrument errors affect data?

They can introduce incorrect measurements; fix by comparing with normal data and adjusting accordingly.