1/106
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
NumPy array
An ordered collection of values of the same type that supports fast elementwise arithmetic (sum, mean, etc); in Data 8 you store numeric data in arrays so you can compute on whole columns at once.
Column of a table
An array containing all the values of one variable across rows; in code you get it with t.column("Label") and then use array operations on it.
Row of a table
A single record across all columns in a table; questions often have you think about what a row represents (one person, one movie, one trial, etc.).
Table.read_table
Function that reads a CSV file from disk and returns a table; you use it to bring raw data into Python for analysis.
Table.with_column
Method that returns a new table with an extra column you compute from existing data (for example BMI, predicted values, indicators); core for feature engineering.
select (tables)
Method that keeps only the specified columns of a table by label or index and returns a narrower table; you use it to focus on relevant variables.
drop (tables)
Method that removes the specified columns from a table and returns the rest; useful for stripping away distracting columns before plotting or modeling.
where (exact match)
Method that filters a table to rows where a column equals a given value exactly, like t.where("Team","UVA"); used to isolate a single group.
where (condition)
Method that filters a table to rows where a numeric condition holds, like t.where("Age", are.above(30)); used to keep only values in a range.
group (tables)
Method that collapses a table into one row per category, usually counting rows or aggregating a column (mean, sum, etc); used for “by-group” summaries.
pivot (tables)
Method that produces a two-way table with one variable as rows, one as columns, and an aggregate value (count or average) in each cell; lets you see how two categorical variables interact.
join (tables)
Method that merges two tables by matching values in key columns; used when information you need is split across multiple tables.
sort (tables)
Method that orders rows by the values in a chosen column, optionally with descending=True; useful for finding largest/smallest values and making tables readable.
take (tables)
Method that keeps rows at specific indices (starting at 0); you use it when questions refer to “row 0–9,” or when you manually split into training and test sets.
numrows and numcolumns
Attributes that give the number of rows and columns in a table; used to describe data size and to check that reshaping or splitting worked.
np.random.choice
Function for random sampling from an array, with or without replacement and optional probabilities; main tool for simulations and empirical distributions under a chance model.
Categorical variable
A variable whose values are labels from a finite set of categories (studio, class year, outcome type); summarized with counts, bar charts, and contingency tables.
Numerical variable
A variable measured with numbers where order and differences matter (height, income, score); summarized with histograms, means, medians, and SD.
Distribution of a variable
Description of how often each value or range of values occurs; in practice you look at a histogram for numeric variables and a bar chart for categorical ones.
Bar chart
Plot for a categorical variable where bar heights represent counts or percentages for each category; used to compare category frequencies.
Histogram
Plot for a numerical variable that groups values into bins and uses bar area to represent the percent of data in each bin; used to read shape, center, and spread.
Histogram bin
The numeric interval covered by a single histogram bar (for example 10–20); changing bin width can change the apparent shape of the distribution.
Histogram density (height)
Bar height defined as percent in bin divided by bin width; makes bar area proportional to percent so wider bins do not automatically look “more.”
Area principle
Rule that in any data graphic, areas of shapes should be proportional to the quantities they represent; violated histograms or misleading graphics break this rule.
Scatterplot
Plot of paired numerical data (x,y) as points; used to assess direction, form, and strength of the relationship between two quantitative variables.
Line plot
Plot that connects points in time order; used when the x-axis is time and you care about trends over time.
Mean
The average of a numerical variable (sum divided by number of values); the balance point of the histogram and sensitive to extreme values.
Median
The value with 50% of the data at or below it and 50% at or above it; a typical value that is resistant to outliers and skew.
Effect of skew on mean vs median
Right-skew pulls the mean to the right so mean > median; left-skew pulls it left so mean < median; for skewed data the median is usually a better “typical” value.
Standard deviation
A typical distance of data values from their mean; computed as square root of the mean of squared deviations; larger SD means more spread.
Percentile
The smallest value that is at least as large as a given percentage of the sorted data; for example the 25th percentile has 25% of values at or below it.
Standard units (z-score)
A value expressed as number of SDs above or below the mean: (value − mean)/SD; used to compare values on different scales and in normal approximations.
Normal distribution
A bell-shaped, symmetric distribution fully described by mean and SD; many sample averages are approximately normal even when individual data are not.
Empirical rule
For a roughly normal distribution, about 68% of values are within 1 SD of the mean, about 95% within 2 SDs, and almost all within 3 SDs.
Law of large numbers
As the number of independent repetitions grows, empirical proportions and means get closer to the true probability or population mean.
Random variable
A numerical outcome of a random process (like number of heads in 10 flips or number of toy boxes in a sample); probability questions almost always define one.
Probability of an event
The long-run proportion of times the event would occur in many independent repetitions of the same random process.
Equally likely outcomes rule
If all outcomes are equally likely, P(A) = (number of outcomes in A)/(total number of outcomes); basic rule for simple counting problems.
Multiplication rule
Probability that A and B both occur equals P(A) × P(B | A); if A and B are independent this simplifies to P(A) × P(B).
Addition rule for disjoint events
If an event can happen in exactly one of several non-overlapping ways, its probability is the sum of the probabilities of each way.
Complement rule
P(not A) = 1 − P(A); especially useful for “at least one success” questions as 1 − P(zero successes).
Conditional probability
P(A | B) is the chance A happens among only those outcomes where B happened; represents updated probability after learning that B occurred.
Independence (probability)
Events A and B are independent if knowing one occurred does not change the chance of the other, so P(A and B) = P(A) × P(B) and P(A | B) = P(A).
Tree diagram for probability
A diagram of stages and branches labeled with conditional probabilities; used to compute joint and conditional probabilities and to set up Bayes rule.
Bayes rule
Formula that converts P(evidence | hypothesis) and prior probabilities into P(hypothesis | evidence); in Data 8 it is how you update beliefs after seeing a test result.
Prior probability
Probability you assign to a hypothesis before seeing new data (for example the chance a randomly chosen person has a disease before testing).
Likelihood
Probability of the observed data assuming a particular hypothesis is true; in diagnostic problems it is P(test result | disease status).
Posterior probability
Updated probability of a hypothesis after seeing data, computed via Bayes rule; for example P(disease | positive test).
Sensitivity (true positive rate)
P(test positive | condition present); high sensitivity means the test rarely misses people who truly have the condition.
Specificity (true negative rate)
P(test negative | condition absent); high specificity means the test rarely flags people who do not have the condition.
False positive rate
P(test positive | condition absent); equals 1 − specificity and describes how often the test cries wolf.
False negative rate
P(test negative | condition present); equals 1 − sensitivity and describes how often the test misses real cases.
Bayesian classification view
View of classification as choosing the class with the largest posterior probability P(class | features); in practice you approximate these posteriors from data or a model.
Population
The full collection of individuals or items you care about, such as all UVA students or all cereal boxes produced this year.
Sample
The subset of the population that you actually observe and analyze; all statistics and plots come from the sample, not the full population.
Probability sample
Any sampling design where every unit has a known selection chance; allows valid quantification of sampling variability and generalization to the population.
Convenience sample
A non-random sample taken from whoever is easiest to reach; often biased because selection probabilities are unknown or unequal.
Sampling with replacement
Sampling where selected individuals are returned to the pool and can be chosen again; common in simulations and bootstrap resampling.
Sampling without replacement
Sampling where selected individuals are removed and cannot be selected again; typical in real surveys.
Probability distribution
The theoretical list or function that gives all possible values of a random variable and the probability of each one.
Empirical distribution of data
The observed distribution based on actual data values and the fraction of times each value occurs; summarized by histograms or bar charts.
Parameter
A fixed, usually unknown number describing a population (like the true mean or proportion); the target of inference.
Statistic
A number computed from a sample (sample mean, sample proportion, sample slope); it varies from sample to sample and is used to estimate a parameter.
Sampling distribution of a statistic
The probability distribution of a statistic over all possible random samples; describes how the statistic would vary if you repeated the study many times.
Empirical distribution of a statistic
An approximate sampling distribution built from many simulated values of the statistic (for example via bootstrap or repeated random sampling).
Treatment group
The group in an experiment that receives the treatment or intervention of interest.
Control group
The group in an experiment that does not receive the treatment and serves as a baseline for comparison.
Observational study
A study where the researcher only observes existing choices or exposures without assigning treatments; more vulnerable to confounding than randomized experiments.
Confounding factor
A variable related to both the treatment and the outcome that can create a misleading association that is not actually causal.
Randomized controlled experiment
An experiment where units are randomly assigned to treatment or control; randomization balances confounders so differences in outcomes can be attributed to the treatment.
Chance model
A probability model that describes how data would behave if only random chance were operating (for example “labels are randomly shuffled”); used as the null model in tests.
Null hypothesis
The claim that there is no effect, no difference, or that the chance model is correct (for example equal distributions for A and B); assumed true for the sake of the test.
Alternative hypothesis
The competing claim that there is a real effect or difference (for example one group has a larger mean); what you are looking for evidence in favor of.
Test statistic
A single number computed from the data such that more extreme values favor the alternative over the null; examples include differences in means or TVD.
Simulating under the null
Process of using the null model to repeatedly generate data and compute the test statistic, building an empirical distribution of what is typical if the null is true.
P-value
The probability, under the null hypothesis, of getting a test statistic at least as extreme as the one observed in the direction of the alternative.
Significance level
A cutoff probability (often 0.05 or 0.01); if the P-value is below this level, the result is called statistically significant and you reject the null.
Type I error rate
The long-run probability of rejecting a true null hypothesis; if you reject whenever P-value < alpha, this error rate equals alpha.
Statistically significant result
A result whose P-value is below the chosen significance level; indicates data that would be rare if the null were true, but does not automatically mean the effect is large or important.
Total variation distance
For categorical data, half the sum of absolute differences between category proportions in two distributions; used as a test statistic to measure how far observed counts are from the model.
A/B test
A randomized comparison between two versions (A and B), often using a difference in group means or proportions as the test statistic to see if one version outperforms the other.
Bootstrap sample
A sample of the same size as the original, drawn with replacement from the original sample; treats the sample as a stand-in for the population.
Bootstrap distribution
The empirical distribution of many bootstrap statistics (means, medians, slopes, etc.) computed from many bootstrap samples.
Bootstrap principle
If the original sample is large and representative, resampling from it mimics sampling from the population, so the bootstrap distribution approximates the sampling distribution of the statistic.
Percentile confidence interval
A confidence interval for a parameter built by taking appropriate percentiles of the bootstrap distribution (for example the middle 95% for a 95% interval).
Confidence level
The long-run percentage of confidence intervals produced by a method that contain the true parameter in repeated, similar studies (for example 95%).
Correct confidence interval interpretation
A 95% CI means the method would capture the true parameter about 95% of the time in repeated studies; the parameter is fixed, the interval is random.
When not to use bootstrap
Avoid bootstrapping when the original sample is tiny, badly biased, or when you want intervals for statistics driven by rare extremes (like min or max).
Central Limit Theorem
For large random samples, the distribution of the sample mean is approximately normal with mean equal to the population mean and SD equal to the population SD divided by sqrt(n), regardless of the shape of the population.
Distribution of the sample mean
Under CLT conditions, the sample mean is approximately N(mu, sigma/sqrt(n)), centered at the true mean with smaller spread as n increases.
Accuracy vs sample size
As n grows, the SD of the sample mean shrinks like 1/sqrt(n), so estimates become less variable, but each additional observation helps a bit less than the previous one.
Proportion as mean of 0–1s
If you code success as 1 and failure as 0, then the population proportion of successes is the population mean of the 0–1 data, and the sample proportion is the sample mean.
Approximate 95 percent CI for a mean
When the sample mean is roughly normal, an informal 95% CI is sample mean ± 2 × SD(sample mean); this uses the CLT to quantify uncertainty.
Worst-case SD for a 0–1 population
The largest possible SD of a 0–1 variable is 0.5, occurring when half the population are 1s and half are 0s; used as a conservative plug-in when planning sample size.
Sample size for desired CI width
You choose n so that desired width ≈ 4 × SD(statistic)/sqrt(n); for proportions you often use SD ≈ 0.5 to get a safe upper bound on needed n.
Correlation coefficient r
A unitless measure of linear association between two numerical variables, equal to the average product of x and y in standard units; r is between −1 and 1.
Interpretation of r
The sign of r gives the direction (positive or negative) and the magnitude (near 0 or near 1 in absolute value) describes the strength of the linear pattern.
Correlation pitfalls
Correlation only captures linear relationships, is sensitive to outliers, and never by itself proves causation or tells you which variable causes which.
Regression line
The straight line that minimizes the mean squared residuals when predicting y from x; gives the best linear predictor of y based on x under squared error.
Regression slope formula
Slope = r × (SD of y)/(SD of x); in standard units the regression line always has slope equal to r.