Datascience with Python UVA #2

0.0(0)
studied byStudied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/96

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 8:09 PM on 11/29/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

97 Terms

1
New cards

Population

Everyone or everything you care about in a study, even if you cannot collect data from all of them.

2
New cards

Sample

The smaller group you actually collect data from, used to learn about the bigger population.

3
New cards

Probability sample

A sample picked using random chance in a known way so each individual has a known chance to be chosen.

4
New cards

Convenience sample

A sample made of people who are easy to reach which can be biased and not represent the population well.

5
New cards

With vs without replacement

With replacement you can pick the same item more than once without replacement each item can only be picked once.

6
New cards

Simple random sample (SRS)

A random sample where every possible group of a given size is equally likely to be chosen.

7
New cards

Probability distribution

A rule or table that lists all possible outcomes of a random process and how likely each one is.

8
New cards

Empirical distribution

The distribution you see in real data each value and how often it appears in your sample.

9
New cards

Law of Averages

If you repeat a random process many times the observed proportion of an outcome gets closer to its true probability.

10
New cards

Parameter

A fixed number that describes the whole population such as the true average or true proportion usually unknown.

11
New cards

Statistic

A number you calculate from your sample such as the sample mean or sample proportion used to estimate a parameter.

12
New cards

Sampling distribution of a statistic

The distribution of a statistic’s values over all possible random samples from the population.

13
New cards

Empirical distribution of a statistic

The histogram of many simulated or resampled values of a statistic used to approximate its sampling distribution.

14
New cards

Chance model

A description of how data are generated by random mechanisms which we can simulate to see what is typical.

15
New cards

Steps to assess a model

Choose a statistic simulate it many times under the model plot the simulated values and compare your actual statistic to that plot.

16
New cards

sample_proportions

sample_proportions(n, distribution) draws a random sample of size n from a categorical distribution and returns the sample proportions.

17
New cards

Null hypothesis

The “no effect” or “nothing interesting” model that says how the data would look if only chance were operating.

18
New cards

Alternative hypothesis

The competing claim that says something real is happening such as a difference an effect or an association.

19
New cards

Test statistic

A single number summarizing the data so that large or small values give evidence against the null hypothesis.

20
New cards

Empirical null distribution

The distribution of test statistic values you get by simulating data under the null hypothesis many times.

21
New cards

Tail area

The part of the null distribution at or beyond your observed test statistic in the direction that supports the alternative.

22
New cards

p-value

The chance assuming the null is true of getting a test statistic as extreme as or more extreme than what you saw.

23
New cards

Statistical significance (5% level)

A result is significant at the 5 percent level if its p-value is less than 0.05 so it would rarely happen by pure chance under the null.

24
New cards

Significance level

The cutoff number such as 0.05 or 0.01 you choose in advance for deciding if a p-value is small enough to reject the null.

25
New cards

Total variation distance (TVD)

A number between 0 and 1 that measures how different two categorical distributions are 0 means identical 1 means completely different.

26
New cards

Permutation test

A test where you shuffle labels such as “treatment” and “control” many times to see what differences could arise just by chance.

27
New cards

Randomized controlled experiment

A study where units are randomly assigned to treatment or control so differences in outcomes can be interpreted as caused by the treatment.

28
New cards

Percentile

The value below which a certain percent of the data fall for example the 50th percentile is the median.

29
New cards

Bootstrap sample

A new sample of the same size drawn with replacement from your original sample treating the sample as if it were the population.

30
New cards

Bootstrap principle

If your original sample is large and fairly random resampling it with replacement mimics taking new samples from the population.

31
New cards

Bootstrap distribution

The distribution of a statistic computed from many bootstrap samples used to estimate its variability.

32
New cards

95% bootstrap confidence interval

The interval between the 2.5th and 97.5th percentiles of the bootstrap statistics a range of plausible values for the parameter.

33
New cards

When bootstrap is unreliable

Bootstrap can fail when the sample is tiny not random or when the parameter depends on extreme values like the minimum or maximum.

34
New cards

Interpretation of a confidence interval

The method produces intervals that capture the true parameter a certain percent of the time such as 95 percent the parameter itself does not move.

35
New cards

Using a CI for testing

If the hypothesized value is outside the confidence interval you reject the null at that confidence level if it is inside you do not reject.

36
New cards

Distribution of the sample average

The pattern of sample means you would see if you took many random samples of the same size from the population.

37
New cards

Central Limit Theorem (CLT)

For large random samples the distribution of the sample mean is roughly bell shaped centered at the population mean.

38
New cards

95% CLT confidence interval for mean

Take the sample mean and go about two standard errors up and down mean plus or minus 2 times (SD divided by square root of n).

39
New cards

Proportions as 0/1 averages

If you code “yes” as 1 and “no” as 0 then the average of those 0s and 1s equals the proportion of 1s.

40
New cards

CI width for a population proportion

The confidence interval for a proportion gets narrower when the sample size increases and when the data are less variable.

41
New cards

SD of a 0/1 population

For a proportion p of 1s the standard deviation is square root of p times (1 minus p) largest when p equals 0.5.

42
New cards

Sample size from desired CI width

To make a confidence interval half as wide you need about four times as many observations.

43
New cards

Mean (average)

Add up all the values and divide by the number of values the mean is sensitive to extreme values.

44
New cards

Median

The middle value when your data are sorted less affected by outliers than the mean.

45
New cards

Standard deviation (SD)

A number that measures how spread out the data are around the mean larger SD means more spread.

46
New cards

Chebyshev’s inequality

In any distribution most values are within a few standard deviations of the mean no matter what the shape looks like.

47
New cards

Standard units (z-scores)

A way to measure how far a value is from the mean in standard deviations (value minus mean) divided by SD.

48
New cards

Categorical variable

A variable whose values are groups or labels such as “red” “blue” “yes” or “no” instead of numeric measurements.

49
New cards

Numerical variable

A variable measured with numbers where order and differences make sense such as height weight or income.

50
New cards

Bar chart

A plot with one bar for each category where bar lengths show how many or what percent fall in each category.

51
New cards

Histogram

A plot for numerical data where nearby values are grouped into bins and bar areas show how many observations are in each bin.

52
New cards

Bin and bin width

A bin is an interval of values on the number line and its width is how long that interval is.

53
New cards

Area principle

In good graphs the area of shapes matches the quantities they represent so bigger areas mean bigger values.

54
New cards

Histogram height and density

The height of a histogram bar equals percent in the bin divided by bin width showing how crowded the data are in that interval.

55
New cards

Bar chart vs histogram

Bar charts show categories with separate bars histograms show numeric data on a number line usually with touching bars.

56
New cards

Scatterplot

A plot with one point per individual showing two numerical variables on the x and y axes to reveal patterns or relationships.

57
New cards

Line plot

A plot where points are connected in order often used to show how a quantity changes over time.

58
New cards

Probability

The long run fraction of times an event would happen if you repeated the random process many times.

59
New cards

Equally likely outcomes rule

If all outcomes are equally likely an event’s probability is (number of outcomes in the event) divided by (total number of outcomes).

60
New cards

Multiplication rule

The chance two events both happen equals the chance the first happens times the chance the second happens given the first.

61
New cards

Addition rule

If two events cannot happen at the same time the chance that one or the other happens is the sum of their probabilities.

62
New cards

Complement rule

The chance something does not happen is one minus the chance that it does happen.

63
New cards

Table

A Data 8 object with labeled columns and rows where each column is an array representing one variable.

64
New cards

Array

A NumPy object holding an ordered list of values usually numbers on which you can do fast elementwise math.

65
New cards

List vs array

Lists are general Python containers for mixed types arrays are numeric faster and work better with tables and math operations.

66
New cards

Table.read_table

Table.read_table('file.csv') loads a data file into a table with one row per record and one column per variable.

67
New cards

with_column

table.with_column('New', values) returns a new table with an extra column named New filled with the given array.

68
New cards

select and drop

select keeps only the columns you name drop removes the columns you name from the table.

69
New cards

where with conditions

table.where(column, condition) keeps only rows whose values in that column meet the condition such as are.above(10).

70
New cards

sort and take

sort orders rows by a column take grabs rows by index such as table.take(0) or table.take(range(10)).

71
New cards

group

table.group('Label') counts how many rows fall in each category of Label and can also compute summaries like averages in each group.

72
New cards

pivot

pivot makes a table whose rows and columns are categories from two variables often used for contingency or summary tables.

73
New cards

join

table1.join('key', table2, 'key') combines two tables by matching rows that share the same key values.

74
New cards

Table.sample

table.sample(n) randomly selects n rows from a table with_replacement=True allows the same row to be picked more than once for bootstrap.

75
New cards

Plotting methods

Table methods like hist barh scatter and plot draw common charts directly from columns of data.

76
New cards

Simulation pattern

Make an empty array loop many times to compute a simulated value append to the array then turn it into a table and plot a histogram.

77
New cards

Correlation coefficient r

A number between minus one and one that measures how strong and how linear the relationship between two numerical variables is.

78
New cards

Cautions about correlation

Correlation can miss nonlinear patterns be distorted by outliers and does not by itself prove cause and effect.

79
New cards

Regression line

The straight line that best fits the scatter of points used to predict y from x by minimizing squared vertical errors.

80
New cards

Regression prediction

Use the regression line to plug in an x value and get a predicted y value predictions pull extreme x values closer to the mean.

81
New cards

Residual

For each point residual equals actual y minus predicted y showing how far off the regression line’s prediction is.

82
New cards

Root mean squared error (RMSE)

The typical size of the residuals smaller RMSE means the regression line predicts y more accurately.

83
New cards

Properties of residuals

For the best fitting line residuals have an average of zero and show no clear linear pattern with x.

84
New cards

Coefficient of determination (R²)

The fraction of the variation in y that the regression line explains equal to r squared.

85
New cards

Regression model (signal + noise)

Think of each y value as the true line value plus some random noise with average zero.

86
New cards

Regression diagnostics

Use residual plots and histograms to check for nonlinearity changing spread or non normal errors in the regression model.

87
New cards

Bootstrap CI for regression prediction

Resample the data refit the line each time compute predictions at a chosen x and take percentiles of these predictions to form a confidence interval.

88
New cards

Bootstrap CI for slope and slope test

Bootstrap the slope many times build a confidence interval and see if zero lies inside to test for no linear relationship.

89
New cards

Euclidean distance

The straight line distance between two points in feature space found by square rooting the sum of squared differences in each feature.

90
New cards

k-nearest neighbors (k-NN) classifier

To classify a point find its k closest training points and predict the majority class among them.

91
New cards

Training set vs test set

The training set is used to build the model the test set is held back and used only to measure how well the model works.

92
New cards

Classifier accuracy

The percentage of examples in the test set that the classifier labels correctly.

93
New cards

Standardizing features for k-NN

Put each feature into standard units so each feature contributes fairly to distance not just the one with the biggest scale.

94
New cards

Prior probability

How likely an event such as having a disease is before you see any new evidence.

95
New cards

Posterior probability P(A|B)

The updated chance of event A after seeing event B combining the prior and how likely B is if A happens.

96
New cards

Tree diagram method

Draw branches for each stage of an event multiply along each path for joint chances and use them to find conditional probabilities.

97
New cards

Explore top flashcards

SENTENCE STARTERS!
Updated 1028d ago
flashcards Flashcards (52)
WWII
Updated 22d ago
flashcards Flashcards (35)
Word List 3 Math
Updated 1166d ago
flashcards Flashcards (20)
A2.2 Cell Organelles
Updated 884d ago
flashcards Flashcards (51)
Mechanics
Updated 624d ago
flashcards Flashcards (35)
SENTENCE STARTERS!
Updated 1028d ago
flashcards Flashcards (52)
WWII
Updated 22d ago
flashcards Flashcards (35)
Word List 3 Math
Updated 1166d ago
flashcards Flashcards (20)
A2.2 Cell Organelles
Updated 884d ago
flashcards Flashcards (51)
Mechanics
Updated 624d ago
flashcards Flashcards (35)