Data
Collections of observations, such as measurements, genders, or survey
responses
Statistics
The science of planning studies and experiments; obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data
Population
the complete collection of all measurements or data that are being considered
Census
the collection of data from every member of the population
Sample
Subcollection of members selected from a population
Voluntary Response Sample
one in which the respondents themselves decide whether to be included
Parameter
a numerical measurement describing some characteristic of a population
Statistic
a numerical measurement describing some characteristic of a sample
Quantitative Data
Data consisting of numbers representing counts or measurements
Qualitative (Categorial data)
Data consisting of names or labels (not numbers that represent counts or measurements)
Discrete Data
result when the data values are quantitative and the number of values is finite or "countable"
Continuous Data
result from infinitely many possible quantitative values, where the collection of values is not countable
Nominal Level of Measurement
characterized by data that consist of names, labels, or categories only. The data can not be arranged in an ordering scheme (such as low to high)
Ordinal Level of Measurement
data that can be arranged in some order, but differences (obtained by subtraction) between data values either can not be determined or are meaningless
Interval Level of Measurement
Data that can be arranged in order, and differences between data values can be found and are meaningful. Data at the _____ Level does NOT have a natural zero starting point at which none of the quantity is present.
Ratio Level of Measurement
Data that can be arranged in order, differences can be found and are meaningful, and there IS a natural zero starting point
Big Data
Data sets that are too large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of _____ may require software simultaneously running in parallel on many different computers
Data Science
Involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as sociology or finance).
Missing Completely at Random
A data value is missing completely at random if the likelihood of its being missing is independent of its value or any of the other values in the data set. That is, any data value is just as likely to be missing as any other data value.
Missing Not at Random
A data value is missing not at random if the missing value is related to the reason that it is missing.
Placebo
A harmless and ineffective pill, medicine, or procedure sometimes used for psychological benefit or sometimes used by researchers for comparison to other treatments
Experiment
in an experiment, we apply some treatment and then proceed to observe its effects on the individuals. (these individuals are referred to as experimental units, and often called subjects when they are people)
Observational Study
observe and measure specific characteristics, but we don't attempt to modify the individuals being studied
Replication
Repetition of an experiment on more than one individual
Blinding
Used when the subject doesn't know whether he or she is receiving a treatment or a placebo
Placebo Effect
Used when individuals are assigned to different groups through a process of random selection
Double Blinding
the act of blinding both the subjects of an experiment and the researchers who work with the subjects.
Confounding
occurs when we can see some effect, but we can not identify the specific factor that caused it.
Simple Random Sample
A sample of size n selected from the population in such a way that each possible sample of size n has an equal chance of being selected.
Random Sample
has a weaker requirement (as compared to a simple random sample) that all members of the population have the same chance of being selected
Systematic Sampling
we select some starting point and then select every kth (such as every 50th) element in the population
Convenience Sampling
we simply use data that is very easy to get
Stratified Sampling
we subdivide the population into at least two different subgroups (or strata) so that subjects within the same subgroup share the same characteristics (such as gender). Then we draw a sample from each subgroup
Cluster Sampling
we first divide the population area into sections (or clusters). Then we randomly select some of those clusters and choose all the members from those selected clusters.
Cross-Sectional Study
data are observed, measured, and collected at one point in time
Retrospective Study
data are collected from a past time period by going back in time (through examination of records, interviews, and so on)
Prospective (Longitudinal Study)
data are collected in the future from groups that share common factors
Sampling Error
occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result; such an error results from chance sample fluctuations
Non-Sampling Error
the result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances
Nonrandom Sampling Error
the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample
Statistically significant result
one that is very unlikely to occur by chance
Lower Class Limit
End value of a class limit.
Upper Class Limits
Beginning value of a class limit.
Class Boundaries
the numbers used to separate the classes, but without the gaps created by class limits. (The numbers between classes, Ex. Class : 10 - 19 , boundaries = 9.5, 19.5 )
Class Midpoint
the values in the middle of the classes. (Upper Class Limit + Lower Class Limit / 2)
Class Width
the difference between two consecutive lower class limits
Frequency Table (Distribution)
shows how data are partitioned among several categories (or classes) by listing the categories along with the number (frequency) of data values in each of them.
Relative Frequency Distribution
the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies (as percentages or proportions) instead of actual frequencies.
Cumulative Frequency Distribution
A variation of the basic frequency distribution, in which the frequency for each class is the sum of the frequencies for that class and all previous classes.
Histogram
A graph used to show frequency distributions of data points of one variable. (Bar Graph that touches, Each bar sits within the boundaries of each class)
Relative Frequency Histogram
A Histogram that measures the vertical scale on Frequency Percentages % instead of #'s
Normal Distribution
a distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. (Bell Shaped)
Skewed Right Distribution
a distribution that is not symmetrical and extends to one side more than to the other. The tail is on the right side
Skewed Left Distribution
a distribution that is not symmetrical and extends to one side more than to the other. The tail is on the left side.
Uniform Distribution
a type of distribution in which all different possible values occur with approximately the same frequency, so the heights of the bars in the histogram are approximately uniform
DotPlot
a graph of quantitative data in which each data value is plotted as a point (or dot) above a horizontal scale of values. Dots representing equal values are stacked.
Stem-and-Leaf Plot
represents quantitative data by separating each value into two parts: the stem (such as the leftmost digit, 10's) and the leaf (such as the rightmost digit, 1's). Can reconstruct data sets from graph
Time-Series Graph
a graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly.
Bar Graph
uses bars of equal width to show frequencies of categories of categorical (or qualitative) data. Typically has spaces between bars
Pareto Chart
a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right. (NO spaces between bars)
Pie Chart
a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category.
Frequency Polygon
uses line segments connected to points located directly above class midpoint values. A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars.
Relative Frequency Polygon
uses line segments connected to points located directly above class midpoint values but uses relative frequencies (proportions or percentages) for the vertical scale instead.
Pictographs
Drawings of objects. Data that are one-dimensional in nature (such as budget amounts) are often depicted with two-dimensional objects (such as dollar bills) or three-dimensional objects (such as stacks of dollar bills). By using pictographs, artists can create false impressions that grossly distort differences by using these simple principles of basic geometry.
Correlation
a relationship that exists between two variables when the values of one variable are somehow associated with the values of the other variable.
Linear Correlation
exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line.
Scatter Plot
is a plot of paired (x, y) quantitative data with a horizontal x-axis and a vertical y-axis. The horizontal axis is used for the first variable (x), and the vertical axis is used for the second variable (y).
Linear Correlation Coefficient
is denoted by r, and it measures the strength of the linear association between two variables.
P-Value
is the probability of getting paired sample data with a linear correlation coefficient r that is at least as extreme as the one obtained from the paired sample data.
Regression Line
is the straight line that "best" fits the scatterplot of the data.
Descriptive Statistics
summarize or describe relevant characteristics of data
Inferential Statistics
used to make inferences or generalizations about a population
Measure of Center
used to measure the center of a data by finding the Mean, Median, Mode, and Midrange
Mean - (or arithmetic mean)
of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values. Also known as the average
Resistant
if the presence of extreme values (outliers) does not cause it to change very much
Median
of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude.
Mode
of a data set is the value(s) that occur(s) with the greatest frequency.
Bimodal
When two data values occur with the same greatest frequency, each one is a mode
Multimodal
When more than two data values occur with the same greatest frequency, each is a mode
No mode
When no data value is repeated
Midrange
of a data set is the measure of center that is the value midway between the maximum and minimum values in the original data set. It is found by adding the maximum data value to the minimum data value and then dividing the sum by 2
Variation
Describes the spread of data by finding values of range, variance, and standard deviation
Range
of a set of data values is the difference between the maximum data value and the minimum data value.
Standard Deviation
Sample = s, Population = σ. is a measure of how much data values deviate away from the mean.
Biased Estimator
which means that values of the sample standard deviation s do not tend to center around the value of the population standard deviation σ.
Unbiased Estimator
which means that values of s^2 tend to center around the value of σ^2 instead of systematically tending to overestimate or underestimate σ^2
Range Rule of Thumb
Subtract the smallest value in a dataset from the largest and divide the result by four to estimate the standard deviation.
Variance
of a set of values is a measure of variation equal to the square of the standard deviation.
Coefficient of Variation (or CV)
for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean
Z-Score (or standard score or standardized value)
is the number of standard deviations that a given value x is above or below the mean
Percentile
are measures of location, denoted which divide a set of data into 100 groups with about 1% of the values in each group
Quartiles
are measures of location, denoted and which divide a set of data into four groups with about 25% of the values in each group.
Boxplot (or box-and-whisker diagram)
is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile Q1, the median, and the third quartile Q3
Skewed
if the spread of data is not symmetric and extends more to one side than to the other.