DATA1001

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/166

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

167 Terms

New cards

contemporaneous control group

occurs at the same time as the treatment group

New cards

Simpson's Paradox

when relationships between percentages in subgroups are different from when the subgroups are combined, because of a confounding variable

New cards

placebo

a pretend treatment

New cards

placebo effect

the phenomenon in which the expectations of the participants in a study can influence their behavior

New cards

Why do we do a controlled experiment?

to focus on the effects of the treatment

New cards

reproducible research

a report that can be checked and verified by a third party

New cards

double blind experiment

an experiment where both the subjects and investigators do not know which subjects received which treatments

New cards

What is the "gold standard" for performing a data study?

randomized controlled trial (RCT)

New cards

observational study

a study based on data in which no manipulation of factors has been employed

New cards

confounding factors

hidden factors that influence the results

New cards

head(data, n) in R

returns the first n rows of data

New cards

tail(data, n) in R

returns the last n rows of data

New cards

str(data) in R

returns the structure of data

New cards

qualitative variable

descriptive variable

New cards

quantitative variable

numerical variable

New cards

discrete quantitative variable

can take on only particular values and no other values in between

New cards

continuous quantitative variable

can take on any value in a given interval of real numbers

New cards

ordinal qualitative variable

can be ordered or ranked (e.g. months, days of the week)

New cards

nominal qualitative variable

cannot be ordered or ranked (e.g. favorite color)

New cards

binary qualitative variable

only two outcomes

New cards

graph for 1 qualitative variable

simple bar graph

New cards

graph for 2 qualitative variables

double bar graph

New cards

graph for 1 quantitative variable

histogram or boxplot

New cards

graph for 2 quantitative variables

scatterplot

New cards

graph for 1 quantitative variable with multiple categories

comparative boxplot

New cards

height of each block in a histogram

(% in block) / (length of interval)

New cards

IQR

Q3-Q1

New cards

threshold for outliers

Q3 + (1.5 * IQR)

Q1 - (1.5 * IQR)

New cards

mean

average - sum of data / size of data

New cards

median

middle data point of an ordered dataset

New cards

Is the mean robust?

No - easily affected by outliers. Good for symmetric data with not a lot of outliers.

New cards

Is the median robust?

Yes - not easily affected by outliers. Good for skewed data.

New cards

What should the mean and median be paired with?

they should be paired with a measure of spread

New cards

standard deviation

average distance of all the observations of a variable from the mean

New cards

standard deviation formula (population)

New cards

standard deviation formula (sample)

New cards

normal curve

- 68% within 1 std. dev

- 95% within 2 std. dev

- 99.7% within 3 std. dev

New cards

standard units (z-score)

how many standard deviations a data point is away from the mean

New cards

z-score formula

New cards

coefficient of variation

New cards

dim(data) in R

returns the dimensions (rows and columns) of data

New cards

if we have data size n with sample SD, how do you get the population SD?

SDpop = SDsam * sqrt(n-1 / n)

New cards

sd(data) in R

returns the sample SD of data

New cards

popsd(data) in R

returns the population SD of data (requires rafalib)

New cards

steps to calculate RMS

square the numbers, then mean the result, then root the result (reverse)

New cards

How does population SD relate to RMS?

SDpop = RMS of (gaps from the mean)

New cards

standard normal curve

normal distribution with mean 0 and SD 1

New cards

general normal curve

can have any mean and SD

New cards

What goes into an individual measurement?

individual measurement = exact value + chance error + bias

New cards

chance error

small differences in sampling due to chance

New cards

How to estimate chance error

Replicate measurements and calculate SD

New cards

bias (systematic) error

a constant error added to/subtracted from each measurement; can be deliberate or accidental

New cards

For a Normal curve with mean 10 and standard deviation 4, what percentage of the data lie between 10 and 14?

34%

New cards

qnorm(n, x, SD) in R

gets the nth percentile of a standard distribution with mean x and SD (default values for x and SD are 0 and 1)

New cards

pnorm(n, x, SD)

gets the area of the normal curve (with mean x and SD) up to n

New cards

notation for general standard curve

X ~ N(x, v) where x is the mean and v is the variance (SD squared)

New cards

Assuming we have zero bias, when we take a set of measurements in the real world, we can think of each individual measurement as equal to

exact measurement + chance error

New cards

What code would work out the area of a General Normal curve, with mean = 4 and SD = 5, from 0 down.

pnorm(0,4,5)

New cards

bivariative data

data with two variables

New cards

explanatory variable

a variable that we think explains or causes changes in the response variable, i.e. independent variable (x)

New cards

response variable

a variable that measures an outcome or result of a study, i.e. dependent variable (y)

New cards

linear association

how tightly the points cluster around a line

New cards

strong association

points are clustered close together.

New cards

weak association

points are clustered far apart.

New cards

positive association

as x increases, y increases

New cards

negative association

as x increases, y decreases

New cards

correlation coefficient (r)

A numerical index of the degree of relationship between two variables; between -1 and 1.

New cards

correlation coefficient formula (population)

for sample r, denominator is n - 1

New cards

slope of the SD line

SDy/SDx

New cards

slope of the regression line

r(SDy/SDx)

New cards

steps to predict y from percentile of x

1. Find z-score of x: Zx = qnorm(percentile)

2. Translate z-score to y: Zy = r * Zx

3. Translate Zy to percentile: pnorm(Zy)

New cards

residual

the vertical distance of a point above/below a regression line; represents error between actual value and prediction

New cards

RMS error

average gap between the points and regression line

New cards

RMS error formula

New cards

If a linear model is appropriate, what should the residual plot show?

no pattern

New cards

homoscedastic data

the vertical spread of the data is similar over the values of x

New cards

heteroscedastic data

the vertical spread of the data is unequal over the values of x

New cards

normal approximation within a vertical strip

mean* = mean of y + (Zx x r x SDy)

SD*y = RMS error

New cards

prosecutor's fallacy

where it is assumed that the probability of a random match = the probability that the defendant is innocent

New cards

conditional probability

the probability that one event happens given that another event is already known to have happened: P(A | B) (probability of event A given event B)

New cards

independent events

The outcome of one event does not affect the outcome of the second event

New cards

dependent events

The outcome of one event does affect the outcome of the second event

New cards

probability of A AND B

P(A) x P(B | A)

New cards

mutually exclusive events

events that cannot happen at the same time

New cards

probability of A or B if mutually exclusive

P(A) + P(B)

New cards

probability of A or B if not mutually exclusive

P(A) + P(B) - P(A and B)

New cards

How many different ways to arrange a set of n objects?

New cards

If you have a set of n objects and randomly select x objects from the set, how many possible combinations are there?

n choose x

New cards

binomial model

Predicting the number of successes in a fixed number of binary trials

New cards

If you have n binary trials, with one event having p chance to occur, what is the probability that the event occurs exactly x times?

binomial theorem formula

New cards

law of large numbers

as the length of a simulation increases (i.e. sample size increases), the proportion of a certain event approaches a fixed relative frequency

New cards

As sample size increases, does the relative error increase?

No - the absolute error will increase, but the error relative to sample size will decrease

New cards

box model

a model for describing chance processes

New cards

expected value for sample sum

EV = n * mean (n is the sample size)

New cards

standard error for sample sum

SE = SDbox * sqrt(n)

New cards

expected value for sample mean

EV = mean of box

New cards

standard error for sample mean

SE = SDbox / sqrt(n)

New cards

SD of a binary box

SE = (x - y)sqrt(% of x * % of y)

New cards

central limit theorem

The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution (usually n > 30)

100

New cards

continuity correction

Adjustment made when a discrete random variable is being approximated by a continuous random variable (0.5)