1/96
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Population
Everyone or everything you care about in a study, even if you cannot collect data from all of them.
Sample
The smaller group you actually collect data from, used to learn about the bigger population.
Probability sample
A sample picked using random chance in a known way so each individual has a known chance to be chosen.
Convenience sample
A sample made of people who are easy to reach which can be biased and not represent the population well.
With vs without replacement
With replacement you can pick the same item more than once without replacement each item can only be picked once.
Simple random sample (SRS)
A random sample where every possible group of a given size is equally likely to be chosen.
Probability distribution
A rule or table that lists all possible outcomes of a random process and how likely each one is.
Empirical distribution
The distribution you see in real data each value and how often it appears in your sample.
Law of Averages
If you repeat a random process many times the observed proportion of an outcome gets closer to its true probability.
Parameter
A fixed number that describes the whole population such as the true average or true proportion usually unknown.
Statistic
A number you calculate from your sample such as the sample mean or sample proportion used to estimate a parameter.
Sampling distribution of a statistic
The distribution of a statistic’s values over all possible random samples from the population.
Empirical distribution of a statistic
The histogram of many simulated or resampled values of a statistic used to approximate its sampling distribution.
Chance model
A description of how data are generated by random mechanisms which we can simulate to see what is typical.
Steps to assess a model
Choose a statistic simulate it many times under the model plot the simulated values and compare your actual statistic to that plot.
sample_proportions
sample_proportions(n, distribution) draws a random sample of size n from a categorical distribution and returns the sample proportions.
Null hypothesis
The “no effect” or “nothing interesting” model that says how the data would look if only chance were operating.
Alternative hypothesis
The competing claim that says something real is happening such as a difference an effect or an association.
Test statistic
A single number summarizing the data so that large or small values give evidence against the null hypothesis.
Empirical null distribution
The distribution of test statistic values you get by simulating data under the null hypothesis many times.
Tail area
The part of the null distribution at or beyond your observed test statistic in the direction that supports the alternative.
p-value
The chance assuming the null is true of getting a test statistic as extreme as or more extreme than what you saw.
Statistical significance (5% level)
A result is significant at the 5 percent level if its p-value is less than 0.05 so it would rarely happen by pure chance under the null.
Significance level
The cutoff number such as 0.05 or 0.01 you choose in advance for deciding if a p-value is small enough to reject the null.
Total variation distance (TVD)
A number between 0 and 1 that measures how different two categorical distributions are 0 means identical 1 means completely different.
Permutation test
A test where you shuffle labels such as “treatment” and “control” many times to see what differences could arise just by chance.
Randomized controlled experiment
A study where units are randomly assigned to treatment or control so differences in outcomes can be interpreted as caused by the treatment.
Percentile
The value below which a certain percent of the data fall for example the 50th percentile is the median.
Bootstrap sample
A new sample of the same size drawn with replacement from your original sample treating the sample as if it were the population.
Bootstrap principle
If your original sample is large and fairly random resampling it with replacement mimics taking new samples from the population.
Bootstrap distribution
The distribution of a statistic computed from many bootstrap samples used to estimate its variability.
95% bootstrap confidence interval
The interval between the 2.5th and 97.5th percentiles of the bootstrap statistics a range of plausible values for the parameter.
When bootstrap is unreliable
Bootstrap can fail when the sample is tiny not random or when the parameter depends on extreme values like the minimum or maximum.
Interpretation of a confidence interval
The method produces intervals that capture the true parameter a certain percent of the time such as 95 percent the parameter itself does not move.
Using a CI for testing
If the hypothesized value is outside the confidence interval you reject the null at that confidence level if it is inside you do not reject.
Distribution of the sample average
The pattern of sample means you would see if you took many random samples of the same size from the population.
Central Limit Theorem (CLT)
For large random samples the distribution of the sample mean is roughly bell shaped centered at the population mean.
95% CLT confidence interval for mean
Take the sample mean and go about two standard errors up and down mean plus or minus 2 times (SD divided by square root of n).
Proportions as 0/1 averages
If you code “yes” as 1 and “no” as 0 then the average of those 0s and 1s equals the proportion of 1s.
CI width for a population proportion
The confidence interval for a proportion gets narrower when the sample size increases and when the data are less variable.
SD of a 0/1 population
For a proportion p of 1s the standard deviation is square root of p times (1 minus p) largest when p equals 0.5.
Sample size from desired CI width
To make a confidence interval half as wide you need about four times as many observations.
Mean (average)
Add up all the values and divide by the number of values the mean is sensitive to extreme values.
Median
The middle value when your data are sorted less affected by outliers than the mean.
Standard deviation (SD)
A number that measures how spread out the data are around the mean larger SD means more spread.
Chebyshev’s inequality
In any distribution most values are within a few standard deviations of the mean no matter what the shape looks like.
Standard units (z-scores)
A way to measure how far a value is from the mean in standard deviations (value minus mean) divided by SD.
Categorical variable
A variable whose values are groups or labels such as “red” “blue” “yes” or “no” instead of numeric measurements.
Numerical variable
A variable measured with numbers where order and differences make sense such as height weight or income.
Bar chart
A plot with one bar for each category where bar lengths show how many or what percent fall in each category.
Histogram
A plot for numerical data where nearby values are grouped into bins and bar areas show how many observations are in each bin.
Bin and bin width
A bin is an interval of values on the number line and its width is how long that interval is.
Area principle
In good graphs the area of shapes matches the quantities they represent so bigger areas mean bigger values.
Histogram height and density
The height of a histogram bar equals percent in the bin divided by bin width showing how crowded the data are in that interval.
Bar chart vs histogram
Bar charts show categories with separate bars histograms show numeric data on a number line usually with touching bars.
Scatterplot
A plot with one point per individual showing two numerical variables on the x and y axes to reveal patterns or relationships.
Line plot
A plot where points are connected in order often used to show how a quantity changes over time.
Probability
The long run fraction of times an event would happen if you repeated the random process many times.
Equally likely outcomes rule
If all outcomes are equally likely an event’s probability is (number of outcomes in the event) divided by (total number of outcomes).
Multiplication rule
The chance two events both happen equals the chance the first happens times the chance the second happens given the first.
Addition rule
If two events cannot happen at the same time the chance that one or the other happens is the sum of their probabilities.
Complement rule
The chance something does not happen is one minus the chance that it does happen.
Table
A Data 8 object with labeled columns and rows where each column is an array representing one variable.
Array
A NumPy object holding an ordered list of values usually numbers on which you can do fast elementwise math.
List vs array
Lists are general Python containers for mixed types arrays are numeric faster and work better with tables and math operations.
Table.read_table
Table.read_table('file.csv') loads a data file into a table with one row per record and one column per variable.
with_column
table.with_column('New', values) returns a new table with an extra column named New filled with the given array.
select and drop
select keeps only the columns you name drop removes the columns you name from the table.
where with conditions
table.where(column, condition) keeps only rows whose values in that column meet the condition such as are.above(10).
sort and take
sort orders rows by a column take grabs rows by index such as table.take(0) or table.take(range(10)).
group
table.group('Label') counts how many rows fall in each category of Label and can also compute summaries like averages in each group.
pivot
pivot makes a table whose rows and columns are categories from two variables often used for contingency or summary tables.
join
table1.join('key', table2, 'key') combines two tables by matching rows that share the same key values.
Table.sample
table.sample(n) randomly selects n rows from a table with_replacement=True allows the same row to be picked more than once for bootstrap.
Plotting methods
Table methods like hist barh scatter and plot draw common charts directly from columns of data.
Simulation pattern
Make an empty array loop many times to compute a simulated value append to the array then turn it into a table and plot a histogram.
Correlation coefficient r
A number between minus one and one that measures how strong and how linear the relationship between two numerical variables is.
Cautions about correlation
Correlation can miss nonlinear patterns be distorted by outliers and does not by itself prove cause and effect.
Regression line
The straight line that best fits the scatter of points used to predict y from x by minimizing squared vertical errors.
Regression prediction
Use the regression line to plug in an x value and get a predicted y value predictions pull extreme x values closer to the mean.
Residual
For each point residual equals actual y minus predicted y showing how far off the regression line’s prediction is.
Root mean squared error (RMSE)
The typical size of the residuals smaller RMSE means the regression line predicts y more accurately.
Properties of residuals
For the best fitting line residuals have an average of zero and show no clear linear pattern with x.
Coefficient of determination (R²)
The fraction of the variation in y that the regression line explains equal to r squared.
Regression model (signal + noise)
Think of each y value as the true line value plus some random noise with average zero.
Regression diagnostics
Use residual plots and histograms to check for nonlinearity changing spread or non normal errors in the regression model.
Bootstrap CI for regression prediction
Resample the data refit the line each time compute predictions at a chosen x and take percentiles of these predictions to form a confidence interval.
Bootstrap CI for slope and slope test
Bootstrap the slope many times build a confidence interval and see if zero lies inside to test for no linear relationship.
Euclidean distance
The straight line distance between two points in feature space found by square rooting the sum of squared differences in each feature.
k-nearest neighbors (k-NN) classifier
To classify a point find its k closest training points and predict the majority class among them.
Training set vs test set
The training set is used to build the model the test set is held back and used only to measure how well the model works.
Classifier accuracy
The percentage of examples in the test set that the classifier labels correctly.
Standardizing features for k-NN
Put each feature into standard units so each feature contributes fairly to distance not just the one with the biggest scale.
Prior probability
How likely an event such as having a disease is before you see any new evidence.
Posterior probability P(A|B)
The updated chance of event A after seeing event B combining the prior and how likely B is if A happens.
Tree diagram method
Draw branches for each stage of an event multiply along each path for joint chances and use them to find conditional probabilities.