Statistics Flashcards

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/61

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

62 Terms

New cards

summary statistic

a single number summarizing a large amount of data

ie. , the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.

Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%. Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.

New cards

case/observational unit

formal name for a row

New cards

variables

the columns that represent characteristics

New cards

data matrix

a convenient and common way to organize data, especially if collecting data in a spreadsheet

New cards

numerical variable

a variable that can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values

ie. unemployment rate

New cards

discrete variable

a type of numerical variable where it can only take numerical values with jumps and can only take whole non-negative numbers (0, 1, 2, ...)

ie. population rate or number of children.

New cards

continuous variable

a type of numerical variable that can take any value within a given range, including fractions and decimals.

ie. unemployment rate

New cards

categorical variable

A variable that represents categories or groups and is not numerical in nature.

ie. states

New cards

levels

the possible values of a variable

ie. variable is states, then the level is AL, AK, WY, etc

New cards

ordinal variable

a categorical variable but the levels have a natural ordering

ie. educational level

New cards

nominal variable

a regular categorical variable without a type of special ordering

New cards

variables & their specializations

New cards

Explanatory (Independent) & Response (Dependent) Variables

New cards

observational study

One type of data collection where researchers collect data in a way that does not directly interfere with how the data arise

New cards

cohort

a group of many similar individuals

New cards

experiment

One type of data collection when researchers want to investigate the possibility of a causal connection

New cards

randomized experiment

When individuals are randomly assigned to a group in an experiment

New cards

placebo

fake treatment

New cards

ASSOCIATION DOESN’T EQUAL CAUSATION

In general, association does not imply causation, and causation can only be inferred from a randomized experiment.

New cards

contingency table

A table that summarizes data for two categorical variables

New cards

row totals

provide the total counts across each row

New cards

column totals

total counts down each column

New cards

bar plot

New cards

row proportions

counts divided by their row totals

New cards

column proportion

count divided by the corresponding column total

New cards

stacked bar plot vs. side-by-side bar plot vs. standardized stacked bar plot

a graphical display of contingency table information

New cards

mosaic plot

a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.

<p>a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.</p>

New cards

pie chart

Useful for giving a high-level overview to show how a set of cases break down. However, it is also difficult to decipher details in a pie chart.

New cards

side-by-side box plot vs. hollow histogram

a traditional tool for comparing across groups vs. used to compare numerical data across groups

New cards

Cross-sectional

when all the data are collected at one point in time

New cards

time series

when all the data are collected over a period of time (ie. from 1946 to 2020)

New cards

categorical values & histogram

DO NOT MAKE A HISTOGRAM/COMPUTE MEAN OR SD FOR A CATEGORICAL VARIABLE

New cards

median & IQR (measure of center & variation)

when the data has significant skew

New cards

mean & standard deviation (measure of center & variation)

when the data is symmetric; in addition, they are more affected by extreme observations

New cards

left-skewed vs right-skewed on boxplots

left-skewed = median closer to Q3

right-skewed = median closer to Q1

New cards

Descriptive statistics are useful _

in that they are easy to calculate, summarize information efficiently, and allow for straightforward comparisons between groups.

New cards

Absolute Figure

Absolute figures can usually be interpreted without any context or additional information - a score, number, or figure has some intrinsic meaning

i.e When I tell you that I shot 83, you don’t need to know what other golfers shot that day in order to evaluate my performance

New cards

Relative Figure

A value or figure has meaning only in comparison to something else, or in some broader context, such as compared with the eight golfers who shot better than I did.

ie. If 43 correct answers falls into the 83rd percentile, then this student is doing better than most of his peers statewide. If he’s in the 8th percentile, then he’s really struggling. In this case, the percentile (the relative score) is more meaningful than the number of correct answers (the absolute score).

New cards

standard deviation

a measure of how dispersed the data are from their mean
roughly describes how far away the typical observation is from the mean
square root of the variance.

New cards

index

a descriptive statistic made up of other descriptive statistics

New cards

Histograms

a more heavily binned version of the stacked dot plot

provide a view of the data density
convenient for understanding the shape of the data distribution

New cards

right skewed vs. left skewed vs symmetric (ONLY DESCRIBED IN HISTOGRAMS/BOXPLOTS, THINGS WITH NUMERIC DATA NOT CATEGORICAL)

longer right tail (mean > median) vs. long left tail (mean < median) vs. equal trailing off both sides (mean = median)

New cards

mode

represented by a prominent peak in the distribution

New cards

unimodal vs bimodal vs multimodal

multimodal - Any distribution with more than 2 prominent peaks

unimodal - one prominent peak & with a second less prominent peak that was not counted since it only differs from its neighboring bins by a few observations

New cards

variance

the average squared distance from the mean

New cards

box plot

summarizes a data set using five statistics while also plotting unusual observations

New cards

median

splits the data in half
If the data are ordered from smallest to largest, the _ is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.

New cards

interquartile range (IQR)

It, like the standard deviation, is a measure of variability in data. The more variable the data, the larger the standard deviation and IQR tend to be.
The is the length of the box in a box plot. It is computed as _ = Q3 − Q1 where Q1 and Q3 are the 25th and 75th percentiles.

New cards

first quartile (Q1)

the 25th percentile, i.e. 25% of the data fall below this value

New cards

third quartile (Q3)

the 75th percentile

New cards

finding outliers w/ IQR

Q3+1.5IQR = High Outlier

Q1-1.5IQR = Low Outlier

New cards

Range

Highest Value - Lowest Value

New cards

leverage

data points with extreme X values

New cards

influential points

outlier(s) that change the model

New cards

predictor variable

the independent or X variable in a linear relationship

New cards

correlation

measurement of the strength of a relationship between two numeric variables

New cards

A distinct pattern of some sort in a residual plot indicates that a linear model is NOT a good fit for the data.

New cards

If the X and Y axes were reversed on a scatterplot

any positive relationships would still appear as positive relationships.

New cards

Correlation (the degree to which two phenomena are related to one another) does not imply causation; a positive or negative association between two variables does not necessarily mean that a change in one of the variables is causing the change in the other

For example, I alluded earlier to a likely positive correlation between a student’s SAT scores and the number of televisions that his family owns. This does not mean that overeager parents can boost their children’s test scores by buying an extra five televisions for the house. Nor does it likely mean that watching lots of television is good for academic achievement.

New cards

coefficient of variations

a statistical measure that describes the relative variability of a dataset by taking the ratio of the standard deviation to the mean (CV = Standard Deviation / Mean)
It is a unitless value, often expressed as a percentage, that allows for the comparison of variability across different datasets or groups, especially those with different means or measurement units.

New cards

Empirical Rule

for a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
This rule applies to symmetric, bell-shaped data and is used to estimate the percentage of values in specific intervals around the mean and to identify potential outliers.

New cards