Introduction to Statistics

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/74

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

75 Terms

1
New cards

What is statistics?

The art and science of learning from data; It deals with collection, classification, analysis, and interpretation of information or data

2
New cards

3 main components of statistics

• Study design: Planning how to obtain data to help answer the questions of interest (Data Collection)

• Description: Exploring and summarizing patterns in the data (Data Analysis)

• Inference: Making decisions and predictions on a population based on data/known evidence (Inference)

3
New cards

Types of data collection

Interviews, observations, surveys, usage data, focus groups

4
New cards

Types of statistics

Descriptive: Involves organizing, displaying, and describing data (graphs, bar charts, etc.)

Inferential: Consists of methods for estimating and drawing conclusions about a population based on a sample or a subset drawn from that population (hypothesis testing, confidence intervals, etc.)

5
New cards

What is data?

Systematically recorded information such as numbers, characters, images, or labels, along with its context, which explains who, what, when, where, how, and why it is being measured.

6
New cards

Types of data

Tabular, networks/graphs, spatio, and images/video

7
New cards

Tabular Data

A collection of objects and their attributes. An object (record or observation) is what is described, and an attribute (variable or feature) is a property of that object

8
New cards

Types of variables

Numerical and categorical

9
New cards

Numerical/Quantitative variable

A variable that records measurable numerical values with units. It represents amounts or degrees of something, and operations like adding or averaging make sense.

10
New cards

Discrete (numerical) variable

A type of quantitative variable that takes on a finite or countably infinite set of values. Ex) number of students in a class.

11
New cards

Continuous (numerical) variable

A type of quantitative variable that can take on infinitely many values within a range. Example: age, height, or weight

12
New cards

Categorical/Qualitative variable

A variable that classifies observations into groups or categories, with each case placed into one group. Ex) favorite color, marital status, gender, area code, SSN

13
New cards

Nominal (categorical) variable

Categories that have no natural order. Ex) favorite color, soda brand, yes/no questions.

14
New cards

Ordinal (categorical) variable

Categories that follow a logical order. Ex) drink sizes (small, medium, large), Olympic ranks (gold, silver, bronze).

15
New cards

Transforming numerical into categorical

A numerical variable can be grouped into ranges and treated as categories. Ex) age reported as 18–24, 25–34, 35–44, 45+

16
New cards

Population

The total group of individuals or objects you want to make conclusions about in a statistical study. Ex) all students at Columbia University.

17
New cards

Sample

A subset of the population used to draw conclusions, often a small fraction of the whole. Ex) 1000 randomly surveyed Columbia students.

18
New cards

Parameter

A summary value calculated from a population. Ex) the average age of

all students at Columbia.

19
New cards

Statistic

A summary value calculated from a sample. Ex) the average age of students in one statistics class.

20
New cards

Why can’t we usually observe an entire population?

Studying every individual is often impractical because it takes too long, costs too much, may destroy the items being measured, or is otherwise impossible

21
New cards

A bad sample

A sample that is not representative of the population and lead to biased results. Ex) using income data from only Manhattan households to represent the entire U.S.

22
New cards

Sampling

A method that allows researchers to study a population by investigating a subset instead of every individual. Ex) estimating MLB player salaries by surveying part of the league.

23
New cards

Probability sampling

A method where every individual in the population has a known, nonzero chance of being selected. Ex) simple random sampling of students from a complete roster.

24
New cards

Non-probability sampling

A method where not all individuals have a chance of being selected,

which may lead to bias. Ex) surveying only volunteers.

25
New cards

Sampling Methods

Simple Random Sampling, Stratified Sampling, Cluster Sampling, Systematic Sampling, and Convenience Sampling

26
New cards

Simple Random Sample (SRS)

A sample where every individual has the same chance of being chosen and every possible sample has the same chance of selection. Ex) giving each participant in a sample a number and drawing numbers at random.

27
New cards

Stratified Sampling

The population is divided into subgroups (strata) with similar characteristics, then a random sample is taken from each subgroup. Ex) grouping participants’ hair color and randomly sampling from each group

28
New cards

Cluster Sampling

The population is divided into clusters, but there is more variation within the clusters than between them, so a few clusters are randomly selected and every individual in those clusters is studied. Ex) grouping participants by neighborhoods and randomly selecting a few neighborhoods

29
New cards

Systematic Sampling

Individuals are chosen at regular intervals from an ordered list, starting from a random point. Ex) selecting every n-th participant after picking a random starting number from 1-n.

30
New cards

Convenience Sampling

Participants are chosen based on availability or willingness, making results prone to bias. Ex) surveying students in nearby classes or friends.

31
New cards

Snowball Method

Recruiting one or a few respondents, and they bring in their friends to participate, who bring in their friends to participate, etc. Introduces bias as people usually associate with people similar to themselves, instead of a huge variation

32
New cards

Sampling Frame

A complete list of all individuals or units in the population who are eligible to be selected in a sample. Ex) a university registrar’s list of enrolled students.

33
New cards

Characteristics of a Good Sampling Frame

Comprehensive, up-to-date, non-duplicated, and accessible

34
New cards

Bar Plot

A graph that shows the frequency of each category of a categorical variable using bars

35
New cards

Proportion (p-hat vs p)

The proportion of cases in a category is cases in category ÷ total cases.

The sample proportion is written as p-hat, and the population proportion as p.

36
New cards

Contingency Table

A table that shows the frequency of cases for combinations of two categorical variables. Ex) class year (rows) by early class status (columns).

37
New cards
<p>Segmented Bar Plot</p>

Segmented Bar Plot

A graph that compares groups by placing separate bars for each category of a second variable next to each other. Ex) separate bars for early vs no early class shown side by side within each class year.

38
New cards
<p>Side-by-Side Bar Plot</p>

Side-by-Side Bar Plot

A graph that compares groups by placing separate bars for each category of a second variable next to each other. Ex) separate bars for early vs no early class shown side by side within each class year.

39
New cards
<p>Mosiac Plot </p>

Mosiac Plot

A graph that uses the area of rectangles to show relationships between two or more categorical variables, and the width of the columns shows the frequency for the respondent. Ex) showing how sleep duration varies by profession.

40
New cards
<p>Dot Plot</p>

Dot Plot

A simple graph where each observation is shown as a dot along a number line. Ex) number of rooms in 45 homes displayed with one dot per home.

41
New cards
<p>HIstogram</p>

HIstogram

A graph that groups numerical data into intervals of equal width and shows how many cases fall in each interval. Ex) living area of homes grouped into 250-sq-ft bins.

42
New cards

How to Interpret Histograms

Look at the overall shape, range, and density of the data. Higher bars show where values are more common, and smooth curves can help visualize the general distribution.

43
New cards
<p>Symmetric/Bell-Shaped</p>

Symmetric/Bell-Shaped

Data is clustered in the middle with roughly equal smaller and larger values. Ex) heights of adults

44
New cards
<p>Right-skewed</p>

Right-skewed

Most values are small, with a few very large values stretching the tail to the right. Ex) household income

45
New cards
<p>Left-skewed</p>

Left-skewed

Most values are large, with a few very small values stretching the tail to the left. Ex) age at retirement in some professions

46
New cards
<p>Symmetric but Not Bell-Shaped</p>

Symmetric but Not Bell-Shaped

Values on one side of the distribution mirror the other, but not clustered in the center

47
New cards

Unimodal, Bimodal, Multimodal

A histogram with one peak is unimodal, with two peaks bimodal, and with three or more peaks multimodal.

48
New cards

Two Main Measures of Central Tendency

Mean: the numerical average.

Median: the middle value when data are arranged from smallest to largest.

Both describe the “center” of a distribution in different ways.

49
New cards

Mean

x\left(bar\right)=\frac{x1+x2+\cdots+xn}{n}=\frac{1}{n}\Sigma^{n}_{x=i}x_{i}

Represents the arithmetic average of all observations. Sensitive to extreme values.

50
New cards

Median

Arrange data from smallest to largest. If n is odd → take the middle value. If n is even → take the average of the two middle values. Robust to outliers and skewed data.

51
New cards

Choosing Between Mean and Median

Mean: use for symmetric data without outliers.

Median: use for skewed data or with outliers.

52
New cards

Variation

Reporting only the average hides how much individual values differ. A measure of variation shows how tightly or loosely data cluster around the center, revealing the amount of spread in a dataset. Variation is essential for understanding data. Two datasets can share the same mean but differ greatly in spread. A measure of variability helps describe this spread and is crucial in most statistical analyses.

53
New cards

Range

Formula: Range = Maximum - Minimum

The range measures how spread out data are by subtracting the smallest value from the largest. The range only uses two data points and ignores all others. Because of this, it can give a misleading picture of variability when outliers are present.

54
New cards

Percentiles

Percentiles divide data into 100 equal parts. The pth percentile is the value below which p percent of the observations fall. Percentiles help describe relative position within a dataset.

55
New cards

Interquartile Percentages

  • 25th percentile (Q1): median of the lower half of the data

  • 50th percentile: the median of the full data set

  • 75th percentile (Q3): median of the upper half of the data

56
New cards

Interquartile Range (IQR)

IQR = Q3-Q1

The IQR measures the spread of the middle 50 percent of a dataset. The IQR reduces the influence of outliers and extreme values by focusing only on the central half of the data. It is considered a robust measure of variation.

57
New cards

Boxplots

A boxplot is a visual summary of a dataset that displays its central tendency, spread, and outliers. It is based on five key statistics: minimum, Q1, median, Q3, and maximum. The length of the box and whiskers indicates the amount of variation, and dots may mark outliers.

58
New cards

Variance

Measures how far each data point is from the mean on average. It is found by averaging the squared deviations from the mean, giving a sense of how spread out data are. It is sensitive to outliers.

59
New cards

Population Variance vs. Sample Varinace

Population Variance:\sigma^2=\frac{\Sigma_{i=1}^{n}\left(x_{i}-xbar\right)^2}{n}

Sample Variance:s^2=\frac{\Sigma_{i=1}^{n}\left(x_{i}-xbar\right)^2}{n-1}

60
New cards

Standard Deviation

The standard deviation is the square root of variance that describes the average distance of data points from the mean. Smaller s means values cluster near the mean, larger s means they are more spread out. Like the mean, it is sensitive to outliers.

s=\sqrt{s^2}

61
New cards

The Empirical Rule

The empirical rule describes how data are distributed in a symmetric, bell-shaped pattern. For normal distributions:

  • About 68% of observations fall within 1 standard deviation of the mean.

  • About 95% fall within 2 standard deviations.

  • About 99.7% fall within 3 standard deviations.

62
New cards

Robust Statistics vs. Sensitive Statistics

  • Robust statistics are unaffected by extreme values or skewness. Examples: median (center) and IQR (spread).

  • Sensitive statistics change significantly in the presence of outliers, they are better for normally-distributed data. Examples: mean, standard deviation, and range.

63
New cards

Controlled Experiments

A controlled experiment imposes a treatment on one group (the treatment group) and withholds it from another (the control group). This setup allows researchers to study whether differences in outcomes can be attributed to the treatment itself.

64
New cards

Explanatory and Response Variables

The explanatory is the variable manipulated by researchers. The response variable is the outcome that is measured or observed.

65
New cards

The Placebo Effect

A fake treatment or inert substance that has no therapeutic value. A placebo can be given to the control group in an experiment to prevent subjects from knowing whether they are receiving an active treatment.People often show improvement simply because they believe they are receiving treatment. To separate psychological effects from true treatment effects, researchers use a placebo

66
New cards

Randomized Controlled Experiments

When participants are assigned to groups using an objective chance method, which limits bias and strengthens the credibility of causal conclusions.

67
New cards

Confounding factors

Outside variables can influence both the condition being tested and the measured outcome, making it hard to tell what truly caused the change.

68
New cards

Comparative Experiments

Two or more groups experience different conditions under otherwise similar circumstances. The contrast between their outcomes reveals the true effect of the condition being tested.

69
New cards

Principles of a Well-Designed Experiment

Reliable studies depend on keeping conditions consistent, assigning participants randomly, including enough data for confirmation, and limiting knowledge of group assignments to avoid bias.

70
New cards

Blind Experiements

Participants do not know which group they are in, preventing expectations from influencing their behavior or responses.

71
New cards

Double-Blind Experiments

Neither participants nor researchers know who receives the treatment, reducing both subject and researcher bias.

72
New cards

Observational Studies

Researchers collect information without altering participants’ behavior or an assigning treatments which reveals patterns and relationships but cannot establish direct causation.

73
New cards

Retrospective Studies

Data are gathered from past records or memories to compare previous exposures or conditions among groups.

74
New cards

Prospective studies

Participants are tracked forward in rime to see how future outcomes differ amount groups with different initial characteristics. No treatment is applied; researchers simply observe what occurs naturally.

75
New cards