Statistics Flashcards

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/61

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

62 Terms

1
New cards

summary statistic

a single number summarizing a large amount of data

ie. , the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.

Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%. Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.

2
New cards

case/observational unit

formal name for a row

3
New cards

variables

the columns that represent characteristics

4
New cards

data matrix

a convenient and common way to organize data, especially if collecting data in a spreadsheet

<p>a convenient and common way to organize data, especially if collecting data in a spreadsheet</p>
5
New cards

numerical variable

a variable that can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values

ie. unemployment rate

6
New cards

discrete variable

a type of numerical variable where it can only take numerical values with jumps and can only take whole non-negative numbers (0, 1, 2, ...)

ie. population rate or number of children.

7
New cards

continuous variable

a type of numerical variable that can take any value within a given range, including fractions and decimals.

ie. unemployment rate

8
New cards

categorical variable

A variable that represents categories or groups and is not numerical in nature.

ie. states

9
New cards

levels

the possible values of a variable

ie. variable is states, then the level is AL, AK, WY, etc

10
New cards

ordinal variable

a categorical variable but the levels have a natural ordering

ie. educational level

11
New cards

nominal variable

a regular categorical variable without a type of special ordering

12
New cards
<p>variables &amp; their specializations </p>

variables & their specializations

13
New cards
<p>Explanatory (Independent) &amp; Response (Dependent) Variables</p>

Explanatory (Independent) & Response (Dependent) Variables

14
New cards

observational study

One type of data collection where researchers collect data in a way that does not directly interfere with how the data arise

15
New cards

cohort

a group of many similar individuals

16
New cards

experiment

One type of data collection when researchers want to investigate the possibility of a causal connection

17
New cards

randomized experiment

When individuals are randomly assigned to a group in an experiment

18
New cards

placebo

fake treatment

19
New cards

ASSOCIATION DOESN’T EQUAL CAUSATION

In general, association does not imply causation, and causation can only be inferred from a randomized experiment.

20
New cards

contingency table

A table that summarizes data for two categorical variables

<p>A table that summarizes data for two categorical variables</p>
21
New cards

row totals

provide the total counts across each row

22
New cards

column totals

total counts down each column

23
New cards

bar plot

knowt flashcard image
24
New cards

row proportions

counts divided by their row totals

<p>counts divided by their row totals </p>
25
New cards

column proportion

count divided by the corresponding column total

<p>count divided by the corresponding column total </p>
26
New cards

stacked bar plot vs. side-by-side bar plot vs. standardized stacked bar plot

a graphical display of contingency table information

<p>a graphical display of contingency table information</p>
27
New cards

mosaic plot

a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.

<p>a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.</p>
28
New cards

pie chart

Useful for giving a high-level overview to show how a set of cases break down. However, it is also difficult to decipher details in a pie chart.

<p>Useful for giving a high-level overview to show how a set of cases break down. However, it is also difficult to decipher details in a pie chart.</p>
29
New cards

side-by-side box plot vs. hollow histogram

a traditional tool for comparing across groups vs. used to compare numerical data across groups

<p>a traditional tool for comparing across groups vs. used to compare numerical data across groups</p>
30
New cards

Cross-sectional

when all the data are collected at one point in time

31
New cards

time series

when all the data are collected over a period of time (ie. from 1946 to 2020)

32
New cards

categorical values & histogram

DO NOT MAKE A HISTOGRAM/COMPUTE MEAN OR SD FOR A CATEGORICAL VARIABLE

33
New cards

median & IQR (measure of center & variation)

when the data has significant skew

34
New cards

mean & standard deviation (measure of center & variation)

when the data is symmetric; in addition, they are more affected by extreme observations

35
New cards

left-skewed vs right-skewed on boxplots

left-skewed = median closer to Q3

right-skewed = median closer to Q1

36
New cards

Descriptive statistics are useful _

in that they are easy to calculate, summarize information efficiently, and allow for straightforward comparisons between groups.

37
New cards

Absolute Figure

Absolute figures can usually be interpreted without any context or additional information - a score, number, or figure has some intrinsic meaning

i.e When I tell you that I shot 83, you don’t need to know what other golfers shot that day in order to evaluate my performance

38
New cards

Relative Figure

A value or figure has meaning only in comparison to something else, or in some broader context, such as compared with the eight golfers who shot better than I did.

ie. If 43 correct answers falls into the 83rd percentile, then this student is doing better than most of his peers statewide. If he’s in the 8th percentile, then he’s really struggling. In this case, the percentile (the relative score) is more meaningful than the number of correct answers (the absolute score).

39
New cards

standard deviation

  • a measure of how dispersed the data are from their mean

  • roughly describes how far away the typical observation is from the mean

  • square root of the variance.

40
New cards

index

a descriptive statistic made up of other descriptive statistics

41
New cards
<p>Histograms</p>

Histograms

a more heavily binned version of the stacked dot plot

  • provide a view of the data density

  • convenient for understanding the shape of the data distribution

42
New cards

right skewed vs. left skewed vs symmetric (ONLY DESCRIBED IN HISTOGRAMS/BOXPLOTS, THINGS WITH NUMERIC DATA NOT CATEGORICAL)

longer right tail (mean > median) vs. long left tail (mean < median) vs. equal trailing off both sides (mean = median)

43
New cards

mode

represented by a prominent peak in the distribution

44
New cards
<p>unimodal vs bimodal vs multimodal </p>

unimodal vs bimodal vs multimodal

multimodal - Any distribution with more than 2 prominent peaks

unimodal - one prominent peak & with a second less prominent peak that was not counted since it only differs from its neighboring bins by a few observations

45
New cards

variance

the average squared distance from the mean

46
New cards
<p>box plot</p>

box plot

summarizes a data set using five statistics while also plotting unusual observations

47
New cards

median

  • splits the data in half

  • If the data are ordered from smallest to largest, the _ is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.

48
New cards

interquartile range (IQR)

  • It, like the standard deviation, is a measure of variability in data. The more variable the data, the larger the standard deviation and IQR tend to be.

  • The is the length of the box in a box plot. It is computed as _ = Q3 − Q1 where Q1 and Q3 are the 25th and 75th percentiles.

49
New cards

first quartile (Q1)

the 25th percentile, i.e. 25% of the data fall below this value

50
New cards

third quartile (Q3)

the 75th percentile

51
New cards

finding outliers w/ IQR

Q3+1.5IQR = High Outlier

Q1-1.5IQR = Low Outlier

52
New cards

Range

Highest Value - Lowest Value

53
New cards

leverage

data points with extreme X values

54
New cards

influential points

outlier(s) that change the model

55
New cards

predictor variable

the independent or X variable in a linear relationship

56
New cards

correlation

measurement of the strength of a relationship between two numeric variables

57
New cards

A distinct pattern of some sort in a residual plot indicates that a linear model is NOT a good fit for the data. 

58
New cards

If the X and Y axes were reversed on a scatterplot

any positive relationships would still appear as positive relationships.

59
New cards

Correlation (the degree to which two phenomena are related to one another) does not imply causation; a positive or negative association between two variables does not necessarily mean that a change in one of the variables is causing the change in the other

For example, I alluded earlier to a likely positive correlation between a student’s SAT scores and the number of televisions that his family owns. This does not mean that overeager parents can boost their children’s test scores by buying an extra five televisions for the house. Nor does it likely mean that watching lots of television is good for academic achievement.

60
New cards

coefficient of variations

  • a statistical measure that describes the relative variability of a dataset by taking the ratio of the standard deviation to the mean (CV = Standard Deviation / Mean)

  • It is a unitless value, often expressed as a percentage, that allows for the comparison of variability across different datasets or groups, especially those with different means or measurement units.  

61
New cards

Empirical Rule

  • for a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

  • This rule applies to symmetric, bell-shaped data and is used to estimate the percentage of values in specific intervals around the mean and to identify potential outliers. 

62
New cards