DATA1001

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/166

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

167 Terms

1
New cards

contemporaneous control group

occurs at the same time as the treatment group

2
New cards

Simpson's Paradox

when relationships between percentages in subgroups are different from when the subgroups are combined, because of a confounding variable

<p>when relationships between percentages in subgroups are different from when the subgroups are combined, because of a confounding variable</p>
3
New cards

placebo

a pretend treatment

4
New cards

placebo effect

the phenomenon in which the expectations of the participants in a study can influence their behavior

5
New cards

Why do we do a controlled experiment?

to focus on the effects of the treatment

6
New cards

reproducible research

a report that can be checked and verified by a third party

7
New cards

double blind experiment

an experiment where both the subjects and investigators do not know which subjects received which treatments

8
New cards

What is the "gold standard" for performing a data study?

randomized controlled trial (RCT)

9
New cards

observational study

a study based on data in which no manipulation of factors has been employed

10
New cards

confounding factors

hidden factors that influence the results

11
New cards

head(data, n) in R

returns the first n rows of data

12
New cards

tail(data, n) in R

returns the last n rows of data

13
New cards

str(data) in R

returns the structure of data

14
New cards

qualitative variable

descriptive variable

15
New cards

quantitative variable

numerical variable

16
New cards

discrete quantitative variable

can take on only particular values and no other values in between

17
New cards

continuous quantitative variable

can take on any value in a given interval of real numbers

18
New cards

ordinal qualitative variable

can be ordered or ranked (e.g. months, days of the week)

19
New cards

nominal qualitative variable

cannot be ordered or ranked (e.g. favorite color)

20
New cards

binary qualitative variable

only two outcomes

21
New cards

graph for 1 qualitative variable

simple bar graph

22
New cards

graph for 2 qualitative variables

double bar graph

23
New cards

graph for 1 quantitative variable

histogram or boxplot

24
New cards

graph for 2 quantitative variables

scatterplot

25
New cards

graph for 1 quantitative variable with multiple categories

comparative boxplot

26
New cards

height of each block in a histogram

(% in block) / (length of interval)

27
New cards

IQR

Q3-Q1

28
New cards

threshold for outliers

Q3 + (1.5 * IQR)

Q1 - (1.5 * IQR)

29
New cards

mean

average - sum of data / size of data

30
New cards

median

middle data point of an ordered dataset

31
New cards

Is the mean robust?

No - easily affected by outliers. Good for symmetric data with not a lot of outliers.

32
New cards

Is the median robust?

Yes - not easily affected by outliers. Good for skewed data.

33
New cards

What should the mean and median be paired with?

they should be paired with a measure of spread

34
New cards

standard deviation

average distance of all the observations of a variable from the mean

35
New cards

standard deviation formula (population)

<p></p>
36
New cards

standard deviation formula (sample)

<p></p>
37
New cards

normal curve

- 68% within 1 std. dev

- 95% within 2 std. dev

- 99.7% within 3 std. dev

38
New cards

standard units (z-score)

how many standard deviations a data point is away from the mean

39
New cards

z-score formula

<p></p>
40
New cards

coefficient of variation

<p></p>
41
New cards

dim(data) in R

returns the dimensions (rows and columns) of data

42
New cards

if we have data size n with sample SD, how do you get the population SD?

SDpop = SDsam * sqrt(n-1 / n)

43
New cards

sd(data) in R

returns the sample SD of data

44
New cards

popsd(data) in R

returns the population SD of data (requires rafalib)

45
New cards

steps to calculate RMS

square the numbers, then mean the result, then root the result (reverse)

46
New cards

How does population SD relate to RMS?

SDpop = RMS of (gaps from the mean)

47
New cards

standard normal curve

normal distribution with mean 0 and SD 1

48
New cards

general normal curve

can have any mean and SD

49
New cards

What goes into an individual measurement?

individual measurement = exact value + chance error + bias

50
New cards

chance error

small differences in sampling due to chance

51
New cards

How to estimate chance error

Replicate measurements and calculate SD

52
New cards

bias (systematic) error

a constant error added to/subtracted from each measurement; can be deliberate or accidental

53
New cards

For a Normal curve with mean 10 and standard deviation 4, what percentage of the data lie between 10 and 14?

34%

54
New cards

qnorm(n, x, SD) in R

gets the nth percentile of a standard distribution with mean x and SD (default values for x and SD are 0 and 1)

55
New cards

pnorm(n, x, SD)

gets the area of the normal curve (with mean x and SD) up to n

56
New cards

notation for general standard curve

X ~ N(x, v) where x is the mean and v is the variance (SD squared)

57
New cards

Assuming we have zero bias, when we take a set of measurements in the real world, we can think of each individual measurement as equal to

exact measurement + chance error

58
New cards

What code would work out the area of a General Normal curve, with mean = 4 and SD = 5, from 0 down.

pnorm(0,4,5)

59
New cards

bivariative data

data with two variables

60
New cards

explanatory variable

a variable that we think explains or causes changes in the response variable, i.e. independent variable (x)

61
New cards

response variable

a variable that measures an outcome or result of a study, i.e. dependent variable (y)

62
New cards

linear association

how tightly the points cluster around a line

63
New cards

strong association

points are clustered close together.

64
New cards

weak association

points are clustered far apart.

65
New cards

positive association

as x increases, y increases

66
New cards

negative association

as x increases, y decreases

67
New cards

correlation coefficient (r)

A numerical index of the degree of relationship between two variables; between -1 and 1.

68
New cards

correlation coefficient formula (population)

for sample r, denominator is n - 1

<p>for sample r, denominator is n - 1</p>
69
New cards

slope of the SD line

SDy/SDx

70
New cards

slope of the regression line

r(SDy/SDx)

71
New cards

steps to predict y from percentile of x

1. Find z-score of x: Zx = qnorm(percentile)

2. Translate z-score to y: Zy = r * Zx

3. Translate Zy to percentile: pnorm(Zy)

72
New cards

residual

the vertical distance of a point above/below a regression line; represents error between actual value and prediction

73
New cards

RMS error

average gap between the points and regression line

74
New cards

RMS error formula

<p></p>
75
New cards

If a linear model is appropriate, what should the residual plot show?

no pattern

76
New cards

homoscedastic data

the vertical spread of the data is similar over the values of x

77
New cards

heteroscedastic data

the vertical spread of the data is unequal over the values of x

78
New cards

normal approximation within a vertical strip

mean* = mean of y + (Zx x r x SDy)

SD*y = RMS error

79
New cards

prosecutor's fallacy

where it is assumed that the probability of a random match = the probability that the defendant is innocent

80
New cards

conditional probability

the probability that one event happens given that another event is already known to have happened: P(A | B) (probability of event A given event B)

81
New cards

independent events

The outcome of one event does not affect the outcome of the second event

82
New cards

dependent events

The outcome of one event does affect the outcome of the second event

83
New cards

probability of A AND B

P(A) x P(B | A)

84
New cards

mutually exclusive events

events that cannot happen at the same time

85
New cards

probability of A or B if mutually exclusive

P(A) + P(B)

86
New cards

probability of A or B if not mutually exclusive

P(A) + P(B) - P(A and B)

87
New cards

How many different ways to arrange a set of n objects?

n!

88
New cards

If you have a set of n objects and randomly select x objects from the set, how many possible combinations are there?

n choose x

<p>n choose x</p>
89
New cards

binomial model

Predicting the number of successes in a fixed number of binary trials

90
New cards

If you have n binary trials, with one event having p chance to occur, what is the probability that the event occurs exactly x times?

binomial theorem formula

<p>binomial theorem formula</p>
91
New cards

law of large numbers

as the length of a simulation increases (i.e. sample size increases), the proportion of a certain event approaches a fixed relative frequency

92
New cards

As sample size increases, does the relative error increase?

No - the absolute error will increase, but the error relative to sample size will decrease

93
New cards

box model

a model for describing chance processes

94
New cards

expected value for sample sum

EV = n * mean (n is the sample size)

95
New cards

standard error for sample sum

SE = SDbox * sqrt(n)

96
New cards

expected value for sample mean

EV = mean of box

97
New cards

standard error for sample mean

SE = SDbox / sqrt(n)

98
New cards

SD of a binary box

SE = (x - y)sqrt(% of x * % of y)

99
New cards

central limit theorem

The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution (usually n > 30)

100
New cards

continuity correction

Adjustment made when a discrete random variable is being approximated by a continuous random variable (0.5)