STAT1070: Statistics

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/105

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

106 Terms

1
New cards

Continuous variables

Numerical values measured as part of a whole and can take on any value e.g. percentages, fractions, times. Time and age are always continuous variables.

2
New cards

Discrete variables

Finite numbers that are counted not measured e.g. people (you can’t have half a person)

3
New cards

Nominal variables

Categorical variables that have no natural order. This also includes numerical data that acts as a symbol e.g. post codes or coded variables (1 = yes, 2 = no or outcomes)

4
New cards

Ordinal variables

Categorical variables with a natural order or scale but not with equal intervals e.g. grades.

5
New cards

Population

The entire group that the researcher is interested in and that is used to select the sample.

6
New cards

Sample

Selection/subset of data (ideally representative) from the population of interest

7
New cards

Paramters

Property of the population that you use the statistic to infer. μ (mean), X (observation), σ (SD), P (proportion) and N (size)

8
New cards

Statistics

Property of the sample that you use to infer parameters. x̄ (mean), n (size), x (observation), s (SD), p-hat (proportion).

9
New cards

Graph for ordinal data.

Bar charts (categories on x axis)

<p>Bar charts (categories on x axis)</p>
10
New cards

Graph for nominal data

Pareto chart (line = cumulative percentage) or bar chart for few categories

<p>Pareto chart (line = cumulative percentage) or bar chart for few categories</p>
11
New cards

Graph for continuous data

Histogram or box plot

<p>Histogram or box plot</p>
12
New cards

Graph for discrete data

bar chart (few outcomes) or histogram (many outcomes or the data is sparse)

13
New cards

Graph for two categorical variables

Clustered bar chart (frequencies) or stacked bar chart (proportions) helps with seeing relative difference

<p>Clustered bar chart (frequencies) or stacked bar chart (proportions) helps with seeing relative difference</p>
14
New cards

P-value

The probability of obtaining a result equal to or more extreme than test statistic if the null hypothesis was true i.e. how likely the observed difference between groups is due to chance. Represents the chance of getting the test statistic or something more extreme, and the test statistic value is how close the data is to the null hypothesis.

15
New cards

Correlation

A relationship between two or more things that is measured by the correlation coefficient (strength (0/1) and direction of relationship (-/+)) which indicates the extent to which changes in one variable are related to changes in another)

16
New cards

Coefficient of determination (R2)

Measure of proportion of variance in the dependent variable that can be explained by the independent variable in a regression model. Represents the goodness-of-fit of the regression model and ranges from 0 to 1 with higher = better fit.

17
New cards

Inter Quartile Range (IQR)

Measures the spread of the middle half (50%) of your distribution excluding the outliers. Found by measuring the range between the first quartile (lower) and the third quartile (upper) (Q3 - Q1)

<p><span>Measures the spread of the middle half (50%) of your distribution excluding the outliers. Found by measuring the range between the first quartile (lower) and the third quartile (upper) (Q<sub>3</sub> - Q<sub>1</sub>)</span></p>
18
New cards

Large test Statistic

It is less likely that your data could have occurred under the null hypothesis

19
New cards

Mean (measure of central tendency)

The average by adding all the data points together and dividing by the number of data points

20
New cards

Median (measure of central tendency)

The middle number of data when sorted into ascending/descending. Divides the data into two equal halves with 50% of observations being above and 50% being below to median.

21
New cards

Non-parametric test

Statistical test that doesn't rely on specific assumptions about distribution of data. Used when data doesn't meet assumptions required for parametric tests. Based on ranks or ordering of data and suitable for analysing categorical or ordinal data.

22
New cards

Power (Beta)

Probability of correctly rejecting a false null hypothesis. Measures the ability of a statistical test/study to detect a true effect or relationship. Higher power = higher likelihood of detecting a true effect

23
New cards

Regression Slope (coefficient or beta coefficient)

Measure of change in dependent variable associated with a one-unit change in independent variable in a regression mode. Represents the slope of the regression line and the strength/direction of the relationship between variables

24
New cards

Rules of probability

Addition rule = P(A or B) = P(A) + P(B) - P(A and B); Conditional rule = P(A|B) = P(A and B)/P(B)

25
New cards

Standard deviation

A measure of dispersion/spread of numerical data that quantifies the average amount of deviation of each data point from the mean. Higher SD = greater variability

26
New cards

Type I error (a)

Null hypothesis is rejected even though it is true (incorrect rejection of a true null hypothesis). It represents probability of falsely concluding a relationship between variables when there is none.

27
New cards

Type II error (b)

Null hypothesis isn't rejected even though it is false (failure to reject a false null hypothesis). It represents the probability of failing to detect relationship between variables when there is one.

28
New cards

Z score

Quantifies distance between a data point and the mean of a data set to show you how many standard deviations the value is from the mean. Allows for standardisation.

29
New cards

Shape

Symmetrical (normal distribution), positive (right) skewed, negative (left) skewed, or uniform. Helps identify appropriate measure of centre/spread: symmetrical = mean and SD, skewed = median and IQR (+ mean and SD). Distributions can also be unimodal = one peak/mode or bi/multi modal.

<p>Symmetrical (normal distribution), positive (right) skewed, negative (left) skewed, or uniform. Helps identify appropriate measure of centre/spread: symmetrical = mean and SD, skewed = median and IQR (+ mean and SD). Distributions can also be unimodal = one peak/mode or bi/multi modal.</p>
30
New cards

Spread

Measured through range, variance, SD, and IQR.

31
New cards

Variance

The spread between numbers in a data set. Determines how far each number is from the mean and other numbers in the set. Used to determine SD.

32
New cards

Observational Study

Researchers observe participants with no manipulation of variables to assess the relationship/behaviour in natural setting

33
New cards

Experimental study

Uses random allocation of participants to establish cause-and-effect relationships AND manipulates/controls variables and measures outcome

34
New cards

Longitudinal study (observational)

follows a group of participants over an extended period to examine changes/trends and help establish temporal precedence. Involves collecting data at multiple time points from the same participants

35
New cards

Cross sectional study (observational)

Where you collect data from participants at a single point in time which provides a snapshot of populations characteristics.

36
New cards

Cohort study (observational)

Group of people with common characteristic are followed over time to find how many reach a certain health outcome of interest. Examines the relationship between exposure to certain factors and the development of outcomes/disease

37
New cards

Graphs for continuous (y) and categorical (x) variables

Side-by-side box plots (centre/spread) or vertically aligned histograms (shape)

<p>Side-by-side box plots (centre/spread) or vertically aligned histograms (shape)</p>
38
New cards

Graph for two continuous variables

Scatterplots

39
New cards

Describing scatterplot relationships

Consider strength (strong or weak), linear or non linear, and positive or negative. Non-linear relationships examples: exponential patterns or v shapes (can comment on strength but not direction)

<p>Consider strength (strong or weak), linear or non linear, and positive or negative. Non-linear relationships examples: exponential patterns or v shapes (can comment on strength but not direction)</p>
40
New cards

Outliers

Observations that deviate from distribution pattern caused by natural variation or measurement error. You should always try and explain outliers to discount error.

41
New cards

1.5IQR rule

Suspected outliers are values at least 1.5 x IQR above Q3 or below Q1. Values below Q1 - 1.5IQR are low outliers (lower threshold) and values above Q3 + 1.5IQR are high outliers (upper threshold).

42
New cards

Independent (explanatory) variable

Manipulated/controlled and causes changes in the DV. It’s plotted on the x axis (horizontal)

<p>Manipulated/controlled and causes changes in the DV. It’s plotted on the x axis (horizontal)</p>
43
New cards

Dependent (response) variable

Measured and records the outcome. It is dependent on the IV and plotted on the y axis (vertical)

<p>Measured and records the outcome. It is dependent on the IV and plotted on the y axis (vertical)</p>
44
New cards

How does the shape of distribution change the relationship between mean and median and why?

Symmetrical: mean = median, skewed left: mean < median, and skewed right: mean > median. This is because the mean is affected by outliers i.e. when skewed left there are more low value outliers that decrease the mean but when skewed right there are more high value outliers that increase the mean.

<p>Symmetrical: mean = median, skewed left: mean &lt; median, and skewed right: mean &gt; median. This is because the mean is affected by outliers i.e. when skewed left there are more low value outliers that decrease the mean but when skewed right there are more high value outliers that increase the mean.</p>
45
New cards

Graph for two continuous and one categorical variable

Scatterplot that has a key for the categorical data e.g. different colours or symbols for the different categorical levels.

<p>Scatterplot that has a key for the categorical data e.g. different colours or symbols for the different categorical levels.</p>
46
New cards

Table to describe one categorical variable.

Includes raw (counts) and relative (proportions) frequencies and is the table version of a bar chart.

<p>Includes raw (counts) and relative (proportions) frequencies and is the table version of a bar chart.</p>
47
New cards

Table to describe two categorical variables

Contingency table/cross tabulation that combines two frequency tables to summarise the relationship between the two variables.

<p>Contingency table/cross tabulation that combines two frequency tables to summarise the relationship between the two variables.</p>
48
New cards

Bias

Related to the location of a statistic sampling distribution compared to the location of the true parameter value. If difference is 0 the sample is unbiased. To reduce the bias you use random sampling.

<p>Related to the location of a statistic sampling distribution compared to the location of the true parameter value. If difference is 0 the sample is unbiased. To reduce the bias you use random sampling.</p>
49
New cards

Precision

Related to the spread of sampling distribution i.e. less spread = more precise. You can improve precision by increasing the sample size.

<p>Related to the spread of sampling distribution i.e. less spread = more precise. You can improve precision by increasing the sample size.</p>
50
New cards

Sampling error

The difference between statistic and parameter that is unavoidable but can be reduced in larger samples

51
New cards

Non-sampling error

Any error not caused by sampling size e.g. selection bias and measurement bias

52
New cards
<p>Simpsons paradox</p>

Simpsons paradox

Description of a linear relationship when data is combined is positive however when split into groups it is negative (and vice versa)

<p>Description of a linear relationship when data is combined is positive however when split into groups it is negative (and vice versa)</p>
53
New cards

3 R’s of study design

Randomisation, replication and reducing variation (blocking)

54
New cards

Simple random sampling (probability)

Researchers randomly select members of the population with each member having an equal probability of being selected.

55
New cards

Stratified sampling (probability)

Divide the population into subgroups and randomly sample from each subgroup. This can reduce bias and increase precision.

56
New cards

Cluster sampling (probability)

Split population into groups then randomly select groups and test the entire group e.g. schools. It has the potential of bias and there is limited choice for subgroup representation.

57
New cards

Sequential/systematic sampling (non-probability)

Systematic selection of a sample. Uses a sampling interval determined by population size/desired sample e.g. select every 10th

58
New cards

Convenience sampling (non-probability)

Sample readily available participants however it can cause highly biased data.

59
New cards

Snowball sampling (non-probability)

Sample by using one participant to find others e.g. “do you know anyone else who could participate in the study”

60
New cards

Line-intercept sampling (non-probability)

Line is chosen and any elements in that line form the sample e.g. flight patterns

61
New cards

Sensitivity

The probability of a test or measure to have a true positive result of the condition/disease = P(Positive test|Disease present)

62
New cards

Specificity

The probability of a test or measure to have a true negative result of the condition/disease = P(Negative test|No disease)

63
New cards

Probability notation

P(x) = probability of x event occurring which is always between 1 and 0

P(xc) = probability of a complementary event occurring e.g. x not occurring

64
New cards

Mutually exclusive events

When two (or more) events can’t occur at the same time e.g. roll a 2 and 3 on one die roll.

65
New cards

Collectively exhaustive events

Set of events that encompasses all possible outcomes e.g. 1, 2, 3, 4, 5, and 6 for a die roll.

66
New cards

Marginal probability

The probability of a single event occurring = P(A)

67
New cards

Joint probability

Probability of the intersection of two events = P(A B)

<p>Probability of the intersection of two events = P(A <span>∩</span> B)</p>
68
New cards

Union probability

The probability of A or B or both occurring = P(A ∪ B)

<p>The probability of A or B or both occurring = P(A <span>∪ B)</span></p>
69
New cards

Conditional probability

The probability of two events where A is going to happen given B has already happening = P(A|B)

70
New cards

Probability rules

Union rule: P(A∪B) = P(A) + P(B)

Addition rule: P(A or B) = P(A) + P(B) - P(A and B)

Conditional rule: P(A|B) = P(A ∩ B) divided by P(B)

Joint (rearrangement of conditional rule): P(A∩B) = P(A|B) x P(B)

71
New cards

Independent events

Whether A event happens or not has no effect of P(B). Determined by either equation (you only need to test one):

P(A∩B) = P(A|B) x P(B)

P(A|B) = P(A)

P(B|A) = P(B)

72
New cards

Contingency tables

Useful for joint and marginal probabilities

<p>Useful for joint and marginal probabilities</p>
73
New cards

Venn diagrams

More useful for graphical representations than calculations.

<p>More useful for graphical representations than calculations.</p>
74
New cards

Tree diagrams

Useful for marginal and conditional probabilities

<p>Useful for marginal and conditional probabilities </p>
75
New cards

Partition

Set of mutually exclusive and collectively exhaustive events (e.g. states and the country)

76
New cards

Law of total probability

The total probability of an marginal event using the sum of conditional or joint events.

Joint: P(A) = P(A∩B1) + P(A∩B2) + etc..

Conditional: P(A) = P(A|B1) x P(B1) + P(A|B2) x P(B2)

<p>The total probability of an marginal event using the sum of conditional or joint events.</p><p>Joint: P(A) = P(A∩B<sub>1</sub>) + P(A∩B<sub>2</sub>) + etc..</p><p>Conditional: P(A) = P(A|B<sub>1</sub>) x P(B<sub>1</sub>) + P(A|B<sub>2</sub>) x P(B<sub>2</sub>)</p>
77
New cards

Bayes rule

Used to invert conditional probabilities.

P(A|B) = P(B|A) x P(A) divided by P(B)

78
New cards

Probability distribution

Describes probabilities of experimental outcomes using data or models. Models provide a formula to define distribution of probabilities for numerical random variables.

<p>Describes probabilities of experimental outcomes using data or models. Models provide a formula to define distribution of probabilities for numerical random variables.</p>
79
New cards

Discrete uniform probability distribution

All probabilities in sample space are evenly distributed across outcomes.

Function: P(X = x) = 1/n

Mean: n+1/2

Variance: (n+1) x (n-1)/12

<p>All probabilities in sample space are evenly distributed across outcomes. </p><p>Function: P(X = x) = 1/n</p><p>Mean: n+1/2</p><p>Variance: (n+1) x (n-1)/12</p>
80
New cards

Continuous uniform probability distribution

Function: X ~ D (a,b) a (lower) and b (upper) = range.

Density function: f(x) = 1/b-a where a is less than or equal to x and b is greater than or equal to x AND f = height of density function

Higher value of density function means observing values in that region is more likely. The probability of a particular continuous outcome = 0

<p>Function: X ~ D (a,b) a (lower) and b (upper) = range. </p><p>Density function: f(x) = 1/b-a where a is less than or equal to x and b is greater than or equal to x AND f = height of density function</p><p>Higher value of density function means observing values in that region is more likely. The probability of a particular continuous outcome = 0</p>
81
New cards

Population mean and standard deviation

Population mean = expected value E(X)

Population standard deviation = variance Var(X)

82
New cards

Random variables

Represented by X, they represent the result of chance outcome. The probability of a random variable = P(X = x)

83
New cards

Binomial variable criteria

Variable has fixed no. of trials, 2 possible outcomes (success or failure), constant probability of success, and trials are independent (e.g. random sample)

84
New cards

Binomial distribution

A discrete distribution represented by X ~ B (n, p) with n = no. of trials and p = probability of success

85
New cards

Binomial probability mass function

P(X = x) = (nCx) px (1-p)n-x = probability of observing x successes out of n independent trials

nCx = binomial coefficient that counts the no. of ways to arrange x successes in n trials (often represented as a fraction w/o C)

px (1-p)n-x = probability of each sequences of x successes in n trials

86
New cards

Binomial distribution mean and variance

Mean = E(X) or μ = np

Variance = Var(X) or σ2 = np(1-p)

87
New cards

Normal distributions

Most common continuous distribution that is completely characterised (and calculated) by mean. It must be symmetric and bell shaped and the measures of centre (mean, mode, median) must be equal.

88
New cards

Normal probability density function

Represented by X ~ N (μ, σ2) with μ = mean and σ2 = variance or standard deviation squared.

89
New cards

Empirical rule

68% of values lie within ± 1 SD of mean

95% of values lie within ± 2 SD of mean

99.7% of values lie within ± 3 SD of mean

90
New cards

Standard normal distribution (z score)

Z ~ N (0,1) = standard normal distribution has mean of 0 and SD of 1. You can convert any random variable into a z-score using z = x - μ divided by σ.

91
New cards

Quantile

An observed value for a given probability statement. Can be applied to normal distributions in statstar.

92
New cards

Point estimation

Single value estimate of a parameter based on sample data e.g. mean μ with

93
New cards

Interval estimate

Confidence intervals that estimate where the true population parameter lies between two values with a certain degree of confidence.

94
New cards

Central limit theorem

Can be applied to any numerical variable with finite population mean and sd which allows us to estimate μ and standard error with distribution of sample mean because we know the sample mean will be approximately normal even if population is skewed (allows for empirical rule). Assumptions:

  • Population μ = sample mean x̄ (its a unbiased estimate)

  • Standard error = σ / √n as we increase sample size the standard error decreases → more precise μ estimates

  • Sample sizes >30 will be approximately normally distributed even if X isn’t

95
New cards

Hypothesis test steps

  1. Null and alternative hypothesis = H0: no effect/no relationship e.g. μ = μ HA: difference/relationship μ μ

  2. Test statistic =. summarise difference between means/values

  3. Determine null distribution = distribution of test statistic assuming H0 is true

  4. P-value = probability of getting the test statistic or more extreme

  5. Decision = p < significance level (e.g. 0.05) → reject H0 or vice versa

  6. Conclusions = reject or fail to reject null hypothesis due to evidence (p-value)

*check assumptions: differ depending on test

96
New cards

Confidence intervals

Formula = sample mean ± multiplier x margin of error. Multiplier changes based on confidence. Can be used as an alternate to hypothesis testing e.g. whether or not the μ is in the confidence interval.

97
New cards

Z-test

Uses z scores and the empirical rule to make inferences on p-values and confidence internals. Follows same hypothesis test except the test-statistic is a z-score,

98
New cards

Two-tailed vs one-tailed hypothesis test

One tailed = checks difference in one direction e.g. greater than or less than

Two tailed = checks difference in both directions e.g. greater than and less than

<p>One tailed = checks difference in one direction e.g. greater than or less than</p><p>Two tailed = checks difference in both directions e.g. greater than and less than</p>
99
New cards

Margin of error

z score x σ / √n

As you increase sample size the ME gets smaller, width of CI represents 2 x ME meaning as you increase confidence the ME and CI increases.

100
New cards

Q-Q normal quantile plot

A method of assessing normality by plotting observed data and theoretical quantiles from a normal distribution. If data is normally distributed then the points should approximate a straight line. If there is a clear curve or S shape then that indicates it may not be normally distributed.

<p>A method of assessing normality by plotting observed data and theoretical quantiles from a normal distribution. If data is normally distributed then the points should approximate a straight line. If there is a clear curve or S shape then that indicates it may not be normally distributed.</p>