1/211
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Continuous
Data that can take on any value in an interval.
Discrete
Data that can take on only integer values, such as counts.
Categorical
Data that can take on only a specific set of values representing a set of possible categories.
Binary
A special case of categorical data with just two categories of values (0/1, true/false).
Ordinal
Categorical data that has an explicit ordering.
Data frame
Rectangular data (like a spreadsheet).
Feature
A column in the table.
Outcome
A variable being predicted.
Record
A row in the table.
Mean
The sum of all values divided by the number of values.
Weighted mean
The sum of all values times a weight divided by the sum of the weights.
Median
The value such that one-half of the data lies above and below.
Weighted median
The value such that one-half of the sum of the weights lies above and below the sorted data.
Trimmed mean
The average of all values after dropping a fixed number of extreme values.
Robust
Not sensitive to extreme values.
Outlier
A data value that is very different from most of the data.
Deviations
The difference between observed values and the estimate of location.
Variance
The sum of squared deviations from the mean divided by n-1.
Standard deviation
The square root of the variance.
Mean absolute deviation
The mean of the absolute value of the deviations from the mean.
Median absolute deviation from the median
The median of the absolute value of the deviations from the median.
Range
The difference between the largest and the smallest value in a data set.
Percentile
The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more.
Interquartile range
The difference between the 75th percentile and the 25th percentile.
Boxplot
A plot introduced by Tukey as a quick way to visualize the distribution of data.
Frequency table
A tally of the count of numeric data values that fall into a set of intervals (bins).
Histogram
A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the y-axis.
Density plot
A smoothed version of the histogram, often based on a kernel density estimate.
Mode
The most commonly occurring category or value in a data set.
Expected value
When the categories can be associated with a numeric value, this gives an average value based on a category's probability of occurrence.
Bar charts
The frequency or proportion for each category plotted as bars.
Pie charts
The frequency or proportion for each category plotted as wedges in a pie.
Correlation coefficient
A metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1).
Correlation matrix
A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.
Scatterplot
A plot in which the x-axis is the value of one variable, and the y-axis the value of another.
Contingency tables
A tally of counts between two or more categorical variables.
Hexagonal binning
A plot of two numeric variables with the records binned into hexagons.
Contour plots
A plot showing the density of two numeric variables like a topographical map.
Violin plots
Similar to a boxplot but showing the density estimate.
Sample
A subset from a larger data set.
Population
The larger data set or idea of a data set.
N (n)
The size of the population (sample).
Random sampling
Drawing elements into a sample at random.
Stratified sampling
Dividing the population into strata and randomly sampling from each strata.
Simple random sample
The sample that results from random sampling without stratifying the population.
Sample bias
A sample that misrepresents the population.
Sample statistic
A metric calculated for a sample of data.
Data distribution
The frequency distribution of individual values in a data set.
Sampling distribution
The frequency distribution of a sample statistic over many samples.
Central limit theorem
The tendency of the sampling distribution to take on a normal shape as sample size rises.
Standard error
The variability of a sample statistic over many samples.
Bootstrap
A method to estimate the sampling distribution by drawing additional samples with replacement from the sample itself.
Bootstrap sample
A sample taken with replacement from an observed data set.
Resampling
The process of taking repeated samples from observed data.
Confidence level
The percentage of confidence intervals expected to contain the statistic of interest.
Interval endpoints
The top and bottom of the confidence interval.
Error
The difference between a data point and a predicted or average value.
Standardize
Subtract the mean and divide by the standard deviation.
z-score
The result of standardizing an individual data point.
Standard normal
A normal distribution with mean = 0 and standard deviation = 1.
QQ-Plot
A plot to visualize how close a sample distribution is to a normal distribution.
Tail
The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency.
Skew
Where one tail of a distribution is longer than the other.
n
Sample size.
Degrees of freedom
A parameter that allows the t-distribution to adjust to different sample sizes.
Trial
An event with a discrete outcome (e.g., a coin flip).
Success
The outcome of interest for a trial.
Binomial trial
A trial with two outcomes.
Binomial distribution
Distribution of number of successes in x trials.
Lambda
The rate at which events occur.
Poisson distribution
The frequency distribution of the number of events.
Exponential distribution
The frequency distribution of the time or distance from one event to the next.
Weibull distribution
A generalized version of the exponential, in which the event rate is allowed to change over time.
Treatment
Something to which a subject is exposed.
Treatment group
A group of subjects exposed to a specific treatment.
Control group
A group of subjects exposed to no (or standard) treatment.
Randomization
The process of randomly assigning subjects to treatments.
Subjects
The items that are exposed to treatments.
Test statistic
The metric used to measure the effect of the treatment.
Null hypothesis
The hypothesis that chance is to blame.
Alternative hypothesis
Counterpoint to the null (what you hope to prove).
One-way test
Hypothesis test that counts chance results only in one direction.
Two-way test
Hypothesis test that counts chance results in two directions.
Permutation test
The procedure of combining two or more samples together, and randomly reallocating the observations to resamples.
With or without replacement
In sampling, whether or not an item is returned to the sample before the next draw.
P-value
Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results.
Alpha
The probability threshold of "unusualness" that chance results must surpass, for actual outcomes to be deemed statistically significant.
Type 1 error
Mistakenly concluding an effect is real (when it is due to chance).
Type 2 error
Mistakenly concluding an effect is due to chance (when it is real).
t-statistic
A standardized version of the test statistic.
t-distribution
A reference distribution to which the observed t-statistic can be compared.
False discovery rate
Across multiple tests, the rate of making a Type 1 error.
Adjustment of p-values
Accounting for doing multiple tests on the same data.
Overfitting
Fitting the noise.
d.f.
Degrees of freedom.
Pairwise comparison
A hypothesis test between two groups among multiple groups.
Omnibus test
A single hypothesis test of the overall variance among multiple group means.
Decomposition of variance
Separation of components contributing to an individual value.
F-statistic
A standardized statistic that measures the extent to which differences among group means exceeds what might be expected in a chance model.
Sum of squares
Referring to deviations from some average value.