What Can We Learn From Data?
Any information you learn from a piece of sample data is called a statistic; whereas any information you learn from a population is called a parameter
We collect data from individuals (which can be anything, not just a person)
Variables are any characteristics that can change from individual to individual
Two types of variables: categorical [takes on values that are a category name or group label + usually going to be characterized by a word or phrase; ex: eye color ethnicity, age, fondness of apples] or quantitative [takes on a numerical value that is measured or counted + usually going to be characterized by a number or value; ex: weight, how many candies are in a bag]
Representing a categorical variable with tables
Categorical data can be organized by tables that include the various categories of the study, frequency (the number of each category in the sample, ex: 15 trees, 17 trees, 23 trees), and relative frequency (the proportion of each category in the sample, ex: 0.258 of the trees, 0.360 of the trees, 0.191 of the trees)
Relative frequencies or proportions are better representatives of data than simple frequencies
Two options for graphing categorical data: bar graphs [which can either display the frequencies or relative frequencies of a data set] and pie/circle graphs [which displays each slice as a proportion of the whole]
Distribution of data is what values the data takes on and how often
Best way to talk about distribution of data is often to compare two data samples
Representing a quantitative variable with tables
Two types of quantitative data: discrete [takes on a countable number of value that are usually finite, usually whole numbers; ex: number of goals, number of candies, number of shirts] and continuous [takes on infinitely many values that cannot be counted, usually in decimal points with several decimal places; ex: weight of a frog, speed of a car, time to finish a puzzle]
Can be analyzed into a frequency or relative frequency table
Since there are no categories, the data must be placed into “bins” of intervals that are all equal in size (ex: 10-20, 20-30, 30-40, 40-50, etc)
Basically: how many of our individuals were between the range of each bins? The “how many” is going to be our frequency
Four types of graph can be made from quantitative data:
Dot plot
Stem and Leaf plot
Histogram (usually preferred type of graph; NOT the same as a bar graph)
Cumulative graph
Describing the Distribution of a Quantitative Variable
There are four things that have to be mentioned:
Shape – unimodal, bimodal, gap, clusters, skewed right, skewed left, symmetric, asymmetric
Center – what the average value is
Spread – how the data varies
Outliers – unusual features
Example response: skewed left and unimodal with a center around 110 feet. The tree heights are spread from 20 to 140 feet but very little spread where majority of tree are from 120-140 feet
Measures of center
Mean – sum of the data values divided by the number of values there are
Nonresistant
Median – the middle value
Can be found in exact with an odd number of values; can be found by taking average of the two middle-most values together
Resistant
Put data in number order
Roughly symmetric data = roughly equal mean and median
Skewed left = mean is smaller than median
Skewed right = median is smaller than median
Measures of position
Percentile – interpreted as the value that contains p% of the data less than or equal to it (ex: 25th percentile = that position in the data + everything less than that)
First quartile (Q1) is the 25th percentile or median of the lower half of data
Median is 50th percentile
Third quartile (Q3) is the 75h percentile or median of the upper half of data
Measures of spread
Range
Max value - min value
Easily influenced by outliers
IQR
Q3 - Q1
Spread of the middle 50% of the data
Not influenced by outliers
Standard deviation
Measure variability of the distribution and how far typical values are from the mean
High SD means most data is spread far from the mean
Low SD means most data is near the mean
Easily influenced by outliers
Outliers
Two methods for determining outliers
Fence method: in which an outlier is a value greater than the upper fence or less than the lower fence
Upper fence: Q3 + (1.5*IQR)
Lower fence: Q1 - (1.5*1QR)
2 Standard Deviation method: an outlier is a value that is located 2 or more standard deviations above or below the mean
x̄ + 2 standard deviations (anything above is outlier)
x̄ - 2 standard deviations (anything below is outlier)
Graphical representation of summary statistics
Five number summary: min, Q1, median, Q3, and max
Can be used to create a box plot to summarize the data
Box plots can also potentially show you the skew of a data set (box more to the right can indicate right skew, and vice versa)
Comparing Distributions of a Quantitative Variable
Compare shape, center, and spread + interpret them
BE SPECIFIC (don’t just say 35, say 35 trees)
Some sets of data can be modeled with a density curve [used to model a set of data to give insight as to what the actual population the data is representing could possibly look like]
Ex: normal distribution curve
Empirical rule: in normal distributions, 68% of the population is within the 1 standard deviation of the mean, 95% of the population is within 2 standard deviations of the mean, and 99.7% is within 3 standard deviations of the mean
Most all other data isn’t really necessary
Z score measure how many SDs above or below the mean could be (can be negative or positive)
Formula for z score: Z = (x-μ)/σ
Allows us to compare data better
P(z [<, >, or =] z score); ex: P(z<1.11), P(z>1.11), P(-0.56 < z < 1.11), P(z=1.11)
CALC FUNCTION FOR Z SCORES: 2nd → vars → normalcdf
Lower value: either z score or -99
Upper value: either z score or 99
μ: 0
σ: 1
If given z score, you could find the value that it represents through calc function invNorm
area: z score in decimal
Or plug known numbers into the z score formula and calculate from there