Datascience with Python Study UVA

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/106

There's no tags or description

Looks like no tags are added yet.

Last updated 2:30 PM on 12/2/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

107 Terms

New cards

NumPy array

An ordered collection of values of the same type that supports fast elementwise arithmetic (sum, mean, etc); in Data 8 you store numeric data in arrays so you can compute on whole columns at once.

New cards

Column of a table

An array containing all the values of one variable across rows; in code you get it with t.column("Label") and then use array operations on it.

New cards

Row of a table

A single record across all columns in a table; questions often have you think about what a row represents (one person, one movie, one trial, etc.).

New cards

Table.read_table

Function that reads a CSV file from disk and returns a table; you use it to bring raw data into Python for analysis.

New cards

Table.with_column

Method that returns a new table with an extra column you compute from existing data (for example BMI, predicted values, indicators); core for feature engineering.

New cards

select (tables)

Method that keeps only the specified columns of a table by label or index and returns a narrower table; you use it to focus on relevant variables.

New cards

drop (tables)

Method that removes the specified columns from a table and returns the rest; useful for stripping away distracting columns before plotting or modeling.

New cards

where (exact match)

Method that filters a table to rows where a column equals a given value exactly, like t.where("Team","UVA"); used to isolate a single group.

New cards

where (condition)

Method that filters a table to rows where a numeric condition holds, like t.where("Age", are.above(30)); used to keep only values in a range.

New cards

group (tables)

Method that collapses a table into one row per category, usually counting rows or aggregating a column (mean, sum, etc); used for “by-group” summaries.

New cards

pivot (tables)

Method that produces a two-way table with one variable as rows, one as columns, and an aggregate value (count or average) in each cell; lets you see how two categorical variables interact.

New cards

join (tables)

Method that merges two tables by matching values in key columns; used when information you need is split across multiple tables.

New cards

sort (tables)

Method that orders rows by the values in a chosen column, optionally with descending=True; useful for finding largest/smallest values and making tables readable.

New cards

take (tables)

Method that keeps rows at specific indices (starting at 0); you use it when questions refer to “row 0–9,” or when you manually split into training and test sets.

New cards

numrows and numcolumns

Attributes that give the number of rows and columns in a table; used to describe data size and to check that reshaping or splitting worked.

New cards

np.random.choice

Function for random sampling from an array, with or without replacement and optional probabilities; main tool for simulations and empirical distributions under a chance model.

New cards

Categorical variable

A variable whose values are labels from a finite set of categories (studio, class year, outcome type); summarized with counts, bar charts, and contingency tables.

New cards

Numerical variable

A variable measured with numbers where order and differences matter (height, income, score); summarized with histograms, means, medians, and SD.

New cards

Distribution of a variable

Description of how often each value or range of values occurs; in practice you look at a histogram for numeric variables and a bar chart for categorical ones.

New cards

Bar chart

Plot for a categorical variable where bar heights represent counts or percentages for each category; used to compare category frequencies.

New cards

Histogram

Plot for a numerical variable that groups values into bins and uses bar area to represent the percent of data in each bin; used to read shape, center, and spread.

New cards

Histogram bin

The numeric interval covered by a single histogram bar (for example 10–20); changing bin width can change the apparent shape of the distribution.

New cards

Histogram density (height)

Bar height defined as percent in bin divided by bin width; makes bar area proportional to percent so wider bins do not automatically look “more.”

New cards

Area principle

Rule that in any data graphic, areas of shapes should be proportional to the quantities they represent; violated histograms or misleading graphics break this rule.

New cards

Scatterplot

Plot of paired numerical data (x,y) as points; used to assess direction, form, and strength of the relationship between two quantitative variables.

New cards

Line plot

Plot that connects points in time order; used when the x-axis is time and you care about trends over time.

New cards

Mean

The average of a numerical variable (sum divided by number of values); the balance point of the histogram and sensitive to extreme values.

New cards

Median

The value with 50% of the data at or below it and 50% at or above it; a typical value that is resistant to outliers and skew.

New cards

Effect of skew on mean vs median

Right-skew pulls the mean to the right so mean > median; left-skew pulls it left so mean < median; for skewed data the median is usually a better “typical” value.

New cards

Standard deviation

A typical distance of data values from their mean; computed as square root of the mean of squared deviations; larger SD means more spread.

New cards

Percentile

The smallest value that is at least as large as a given percentage of the sorted data; for example the 25th percentile has 25% of values at or below it.

New cards

Standard units (z-score)

A value expressed as number of SDs above or below the mean: (value − mean)/SD; used to compare values on different scales and in normal approximations.

New cards

Normal distribution

A bell-shaped, symmetric distribution fully described by mean and SD; many sample averages are approximately normal even when individual data are not.

New cards

Empirical rule

For a roughly normal distribution, about 68% of values are within 1 SD of the mean, about 95% within 2 SDs, and almost all within 3 SDs.

New cards

Law of large numbers

As the number of independent repetitions grows, empirical proportions and means get closer to the true probability or population mean.

New cards

Random variable

A numerical outcome of a random process (like number of heads in 10 flips or number of toy boxes in a sample); probability questions almost always define one.

New cards

Probability of an event

The long-run proportion of times the event would occur in many independent repetitions of the same random process.

New cards

Equally likely outcomes rule

If all outcomes are equally likely, P(A) = (number of outcomes in A)/(total number of outcomes); basic rule for simple counting problems.

New cards

Multiplication rule

Probability that A and B both occur equals P(A) × P(B | A); if A and B are independent this simplifies to P(A) × P(B).

New cards

Addition rule for disjoint events

If an event can happen in exactly one of several non-overlapping ways, its probability is the sum of the probabilities of each way.

New cards

Complement rule

P(not A) = 1 − P(A); especially useful for “at least one success” questions as 1 − P(zero successes).

New cards

Conditional probability

P(A | B) is the chance A happens among only those outcomes where B happened; represents updated probability after learning that B occurred.

New cards

Independence (probability)

Events A and B are independent if knowing one occurred does not change the chance of the other, so P(A and B) = P(A) × P(B) and P(A | B) = P(A).

New cards

Tree diagram for probability

A diagram of stages and branches labeled with conditional probabilities; used to compute joint and conditional probabilities and to set up Bayes rule.

New cards

Bayes rule

Formula that converts P(evidence | hypothesis) and prior probabilities into P(hypothesis | evidence); in Data 8 it is how you update beliefs after seeing a test result.

New cards

Prior probability

Probability you assign to a hypothesis before seeing new data (for example the chance a randomly chosen person has a disease before testing).

New cards

Likelihood

Probability of the observed data assuming a particular hypothesis is true; in diagnostic problems it is P(test result | disease status).

New cards

Posterior probability

Updated probability of a hypothesis after seeing data, computed via Bayes rule; for example P(disease | positive test).

New cards

Sensitivity (true positive rate)

P(test positive | condition present); high sensitivity means the test rarely misses people who truly have the condition.

New cards

Specificity (true negative rate)

P(test negative | condition absent); high specificity means the test rarely flags people who do not have the condition.

New cards

False positive rate

P(test positive | condition absent); equals 1 − specificity and describes how often the test cries wolf.

New cards

False negative rate

P(test negative | condition present); equals 1 − sensitivity and describes how often the test misses real cases.

New cards

Bayesian classification view

View of classification as choosing the class with the largest posterior probability P(class | features); in practice you approximate these posteriors from data or a model.

New cards

Population

The full collection of individuals or items you care about, such as all UVA students or all cereal boxes produced this year.

New cards

Sample

The subset of the population that you actually observe and analyze; all statistics and plots come from the sample, not the full population.

New cards

Probability sample

Any sampling design where every unit has a known selection chance; allows valid quantification of sampling variability and generalization to the population.

New cards

Convenience sample

A non-random sample taken from whoever is easiest to reach; often biased because selection probabilities are unknown or unequal.

New cards

Sampling with replacement

Sampling where selected individuals are returned to the pool and can be chosen again; common in simulations and bootstrap resampling.

New cards

Sampling without replacement

Sampling where selected individuals are removed and cannot be selected again; typical in real surveys.

New cards

Probability distribution

The theoretical list or function that gives all possible values of a random variable and the probability of each one.

New cards

Empirical distribution of data

The observed distribution based on actual data values and the fraction of times each value occurs; summarized by histograms or bar charts.

New cards

Parameter

A fixed, usually unknown number describing a population (like the true mean or proportion); the target of inference.

New cards

Statistic

A number computed from a sample (sample mean, sample proportion, sample slope); it varies from sample to sample and is used to estimate a parameter.

New cards

Sampling distribution of a statistic

The probability distribution of a statistic over all possible random samples; describes how the statistic would vary if you repeated the study many times.

New cards

Empirical distribution of a statistic

An approximate sampling distribution built from many simulated values of the statistic (for example via bootstrap or repeated random sampling).

New cards

Treatment group

The group in an experiment that receives the treatment or intervention of interest.

New cards

Control group

The group in an experiment that does not receive the treatment and serves as a baseline for comparison.

New cards

Observational study

A study where the researcher only observes existing choices or exposures without assigning treatments; more vulnerable to confounding than randomized experiments.

New cards

Confounding factor

A variable related to both the treatment and the outcome that can create a misleading association that is not actually causal.

New cards

Randomized controlled experiment

An experiment where units are randomly assigned to treatment or control; randomization balances confounders so differences in outcomes can be attributed to the treatment.

New cards

Chance model

A probability model that describes how data would behave if only random chance were operating (for example “labels are randomly shuffled”); used as the null model in tests.

New cards

Null hypothesis

The claim that there is no effect, no difference, or that the chance model is correct (for example equal distributions for A and B); assumed true for the sake of the test.

New cards

Alternative hypothesis

The competing claim that there is a real effect or difference (for example one group has a larger mean); what you are looking for evidence in favor of.

New cards

Test statistic

A single number computed from the data such that more extreme values favor the alternative over the null; examples include differences in means or TVD.

New cards

Simulating under the null

Process of using the null model to repeatedly generate data and compute the test statistic, building an empirical distribution of what is typical if the null is true.

New cards

P-value

The probability, under the null hypothesis, of getting a test statistic at least as extreme as the one observed in the direction of the alternative.

New cards

Significance level

A cutoff probability (often 0.05 or 0.01); if the P-value is below this level, the result is called statistically significant and you reject the null.

New cards

Type I error rate

The long-run probability of rejecting a true null hypothesis; if you reject whenever P-value < alpha, this error rate equals alpha.

New cards

Statistically significant result

A result whose P-value is below the chosen significance level; indicates data that would be rare if the null were true, but does not automatically mean the effect is large or important.

New cards

Total variation distance

For categorical data, half the sum of absolute differences between category proportions in two distributions; used as a test statistic to measure how far observed counts are from the model.

New cards

A/B test

A randomized comparison between two versions (A and B), often using a difference in group means or proportions as the test statistic to see if one version outperforms the other.

New cards

Bootstrap sample

A sample of the same size as the original, drawn with replacement from the original sample; treats the sample as a stand-in for the population.

New cards

Bootstrap distribution

The empirical distribution of many bootstrap statistics (means, medians, slopes, etc.) computed from many bootstrap samples.

New cards

Bootstrap principle

If the original sample is large and representative, resampling from it mimics sampling from the population, so the bootstrap distribution approximates the sampling distribution of the statistic.

New cards

Percentile confidence interval

A confidence interval for a parameter built by taking appropriate percentiles of the bootstrap distribution (for example the middle 95% for a 95% interval).

New cards

Confidence level

The long-run percentage of confidence intervals produced by a method that contain the true parameter in repeated, similar studies (for example 95%).

New cards

Correct confidence interval interpretation

A 95% CI means the method would capture the true parameter about 95% of the time in repeated studies; the parameter is fixed, the interval is random.

New cards

When not to use bootstrap

Avoid bootstrapping when the original sample is tiny, badly biased, or when you want intervals for statistics driven by rare extremes (like min or max).

New cards

Central Limit Theorem

For large random samples, the distribution of the sample mean is approximately normal with mean equal to the population mean and SD equal to the population SD divided by sqrt(n), regardless of the shape of the population.

New cards

Distribution of the sample mean

Under CLT conditions, the sample mean is approximately N(mu, sigma/sqrt(n)), centered at the true mean with smaller spread as n increases.

New cards

Accuracy vs sample size

As n grows, the SD of the sample mean shrinks like 1/sqrt(n), so estimates become less variable, but each additional observation helps a bit less than the previous one.

New cards

Proportion as mean of 0–1s

If you code success as 1 and failure as 0, then the population proportion of successes is the population mean of the 0–1 data, and the sample proportion is the sample mean.

New cards

Approximate 95 percent CI for a mean

When the sample mean is roughly normal, an informal 95% CI is sample mean ± 2 × SD(sample mean); this uses the CLT to quantify uncertainty.

New cards

Worst-case SD for a 0–1 population

The largest possible SD of a 0–1 variable is 0.5, occurring when half the population are 1s and half are 0s; used as a conservative plug-in when planning sample size.

New cards

Sample size for desired CI width

You choose n so that desired width ≈ 4 × SD(statistic)/sqrt(n); for proportions you often use SD ≈ 0.5 to get a safe upper bound on needed n.

New cards

Correlation coefficient r

A unitless measure of linear association between two numerical variables, equal to the average product of x and y in standard units; r is between −1 and 1.

New cards

Interpretation of r

The sign of r gives the direction (positive or negative) and the magnitude (near 0 or near 1 in absolute value) describes the strength of the linear pattern.

New cards

Correlation pitfalls

Correlation only captures linear relationships, is sensitive to outliers, and never by itself proves causation or tells you which variable causes which.

New cards

Regression line

The straight line that minimizes the mean squared residuals when predicting y from x; gives the best linear predictor of y based on x under squared error.

100

New cards

Regression slope formula

Slope = r × (SD of y)/(SD of x); in standard units the regression line always has slope equal to r.