Introduction to Structured Data and Statistical Concepts

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/211

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

212 Terms

New cards

Continuous

Data that can take on any value in an interval.

New cards

Discrete

Data that can take on only integer values, such as counts.

New cards

Categorical

Data that can take on only a specific set of values representing a set of possible categories.

New cards

Binary

A special case of categorical data with just two categories of values (0/1, true/false).

New cards

Ordinal

Categorical data that has an explicit ordering.

New cards

Data frame

Rectangular data (like a spreadsheet).

New cards

Feature

A column in the table.

New cards

Outcome

A variable being predicted.

New cards

Record

A row in the table.

New cards

Mean

The sum of all values divided by the number of values.

New cards

Weighted mean

The sum of all values times a weight divided by the sum of the weights.

New cards

Median

The value such that one-half of the data lies above and below.

New cards

Weighted median

The value such that one-half of the sum of the weights lies above and below the sorted data.

New cards

Trimmed mean

The average of all values after dropping a fixed number of extreme values.

New cards

Robust

Not sensitive to extreme values.

New cards

Outlier

A data value that is very different from most of the data.

New cards

Deviations

The difference between observed values and the estimate of location.

New cards

Variance

The sum of squared deviations from the mean divided by n-1.

New cards

Standard deviation

The square root of the variance.

New cards

Mean absolute deviation

The mean of the absolute value of the deviations from the mean.

New cards

Median absolute deviation from the median

The median of the absolute value of the deviations from the median.

New cards

Range

The difference between the largest and the smallest value in a data set.

New cards

Percentile

The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more.

New cards

Interquartile range

The difference between the 75th percentile and the 25th percentile.

New cards

Boxplot

A plot introduced by Tukey as a quick way to visualize the distribution of data.

New cards

Frequency table

A tally of the count of numeric data values that fall into a set of intervals (bins).

New cards

Histogram

A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the y-axis.

New cards

Density plot

A smoothed version of the histogram, often based on a kernel density estimate.

New cards

Mode

The most commonly occurring category or value in a data set.

New cards

Expected value

When the categories can be associated with a numeric value, this gives an average value based on a category's probability of occurrence.

New cards

Bar charts

The frequency or proportion for each category plotted as bars.

New cards

Pie charts

The frequency or proportion for each category plotted as wedges in a pie.

New cards

Correlation coefficient

A metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1).

New cards

Correlation matrix

A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

New cards

Scatterplot

A plot in which the x-axis is the value of one variable, and the y-axis the value of another.

New cards

Contingency tables

A tally of counts between two or more categorical variables.

New cards

Hexagonal binning

A plot of two numeric variables with the records binned into hexagons.

New cards

Contour plots

A plot showing the density of two numeric variables like a topographical map.

New cards

Violin plots

Similar to a boxplot but showing the density estimate.

New cards

Sample

A subset from a larger data set.

New cards

Population

The larger data set or idea of a data set.

New cards

N (n)

The size of the population (sample).

New cards

Random sampling

Drawing elements into a sample at random.

New cards

Stratified sampling

Dividing the population into strata and randomly sampling from each strata.

New cards

Simple random sample

The sample that results from random sampling without stratifying the population.

New cards

Sample bias

A sample that misrepresents the population.

New cards

Sample statistic

A metric calculated for a sample of data.

New cards

Data distribution

The frequency distribution of individual values in a data set.

New cards

Sampling distribution

The frequency distribution of a sample statistic over many samples.

New cards

Central limit theorem

The tendency of the sampling distribution to take on a normal shape as sample size rises.

New cards

Standard error

The variability of a sample statistic over many samples.

New cards

Bootstrap

A method to estimate the sampling distribution by drawing additional samples with replacement from the sample itself.

New cards

Bootstrap sample

A sample taken with replacement from an observed data set.

New cards

Resampling

The process of taking repeated samples from observed data.

New cards

Confidence level

The percentage of confidence intervals expected to contain the statistic of interest.

New cards

Interval endpoints

The top and bottom of the confidence interval.

New cards

Error

The difference between a data point and a predicted or average value.

New cards

Standardize

Subtract the mean and divide by the standard deviation.

New cards

z-score

The result of standardizing an individual data point.

New cards

Standard normal

A normal distribution with mean = 0 and standard deviation = 1.

New cards

QQ-Plot

A plot to visualize how close a sample distribution is to a normal distribution.

New cards

Tail

The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency.

New cards

Skew

Where one tail of a distribution is longer than the other.

New cards

Sample size.

New cards

Degrees of freedom

A parameter that allows the t-distribution to adjust to different sample sizes.

New cards

Trial

An event with a discrete outcome (e.g., a coin flip).

New cards

Success

The outcome of interest for a trial.

New cards

Binomial trial

A trial with two outcomes.

New cards

Binomial distribution

Distribution of number of successes in x trials.

New cards

Lambda

The rate at which events occur.

New cards

Poisson distribution

The frequency distribution of the number of events.

New cards

Exponential distribution

The frequency distribution of the time or distance from one event to the next.

New cards

Weibull distribution

A generalized version of the exponential, in which the event rate is allowed to change over time.

New cards

Treatment

Something to which a subject is exposed.

New cards

Treatment group

A group of subjects exposed to a specific treatment.

New cards

Control group

A group of subjects exposed to no (or standard) treatment.

New cards

Randomization

The process of randomly assigning subjects to treatments.

New cards

Subjects

The items that are exposed to treatments.

New cards

Test statistic

The metric used to measure the effect of the treatment.

New cards

Null hypothesis

The hypothesis that chance is to blame.

New cards

Alternative hypothesis

Counterpoint to the null (what you hope to prove).

New cards

One-way test

Hypothesis test that counts chance results only in one direction.

New cards

Two-way test

Hypothesis test that counts chance results in two directions.

New cards

Permutation test

The procedure of combining two or more samples together, and randomly reallocating the observations to resamples.

New cards

With or without replacement

In sampling, whether or not an item is returned to the sample before the next draw.

New cards

P-value

Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results.

New cards

Alpha

The probability threshold of "unusualness" that chance results must surpass, for actual outcomes to be deemed statistically significant.

New cards

Type 1 error

Mistakenly concluding an effect is real (when it is due to chance).

New cards

Type 2 error

Mistakenly concluding an effect is due to chance (when it is real).

New cards

t-statistic

A standardized version of the test statistic.

New cards

t-distribution

A reference distribution to which the observed t-statistic can be compared.

New cards

False discovery rate

Across multiple tests, the rate of making a Type 1 error.

New cards

Adjustment of p-values

Accounting for doing multiple tests on the same data.

New cards

Overfitting

Fitting the noise.

New cards

d.f.

Degrees of freedom.

New cards

Pairwise comparison

A hypothesis test between two groups among multiple groups.

New cards

Omnibus test

A single hypothesis test of the overall variance among multiple group means.

New cards

Decomposition of variance

Separation of components contributing to an individual value.

New cards

F-statistic

A standardized statistic that measures the extent to which differences among group means exceeds what might be expected in a chance model.

100

New cards

Sum of squares

Referring to deviations from some average value.