UNIT 1 | Exploring One Variable Data

What Can We Learn From Data?

  • Any information you learn from a piece of sample data is called a statistic; whereas any information you learn from a population is called a parameter 

  • We collect data from individuals (which can be anything, not just a person)

  • Variables are any characteristics that can change from individual to individual

    • Two types of variables: categorical [takes on values that are a category name or group label + usually going to be characterized by a word or phrase; ex: eye color ethnicity, age, fondness of apples] or quantitative [takes on a numerical value that is measured or counted + usually going to be characterized by a number or value; ex: weight, how many candies are in a bag]

Representing a categorical variable with tables

  • Categorical data can be organized by tables that include the various categories of the study, frequency (the number of each category in the sample, ex: 15 trees, 17 trees, 23 trees), and relative frequency (the proportion of each category in the sample, ex: 0.258 of the trees, 0.360 of the trees, 0.191 of the trees)

    • Relative frequencies or proportions are better representatives of data than simple frequencies 

  • Two options for graphing categorical data: bar graphs [which can either display the frequencies or relative frequencies of a data set] and pie/circle graphs [which displays each slice as a proportion of the whole] 

  • Distribution of data is what values the data takes on and how often

    • Best way to talk about distribution of data is often to compare two data samples

Representing a quantitative variable with tables

  • Two types of quantitative data: discrete [takes on a countable number of value that are usually finite, usually whole numbers; ex: number of goals, number of candies, number of shirts] and continuous [takes on infinitely many values that cannot be counted, usually in decimal points with several decimal places; ex: weight of a frog, speed of a car, time to finish a puzzle] 

  • Can be analyzed into a frequency or relative frequency table

    • Since there are no categories, the data must be placed into “bins” of intervals that are all equal in size (ex: 10-20, 20-30, 30-40, 40-50, etc)

    • Basically: how many of our individuals were between the range of each bins? The “how many” is going to be our frequency

  • Four types of graph can be made from quantitative data: 

    • Dot plot

  • Stem and Leaf plot

  • Histogram (usually preferred type of graph; NOT the same as a bar graph)

  • Cumulative graph

Describing the Distribution of a Quantitative Variable

  • There are four things that have to be mentioned:

    • Shape – unimodal, bimodal, gap, clusters, skewed right, skewed left, symmetric, asymmetric 

    • Center – what the average value is 

    • Spread – how the data varies

    • Outliers – unusual features

  • Example response: skewed left and unimodal with a center around 110 feet. The tree heights are spread from 20 to 140 feet but very little spread where majority of tree are from 120-140 feet

  • Measures of center

    • Mean – sum of the data values divided by the number of values there are

      • Nonresistant

    • Median – the middle value

      • Can be found in exact with an odd number of values; can be found by taking average of the two middle-most values together 

      • Resistant 

      • Put data in number order

    • Roughly symmetric data = roughly equal mean and median

    • Skewed left = mean is smaller than median

    • Skewed right = median is smaller than median

  • Measures of position

    • Percentile – interpreted as the value that contains p% of the data less than or equal to it (ex: 25th percentile = that position in the data + everything less than that)

      • First quartile (Q1) is the 25th percentile or median of the lower half of data

      • Median is 50th percentile

      • Third quartile (Q3) is the 75h percentile or median of the upper half of data

  • Measures of spread

    • Range 

      • Max value - min value 

      • Easily influenced by outliers

    • IQR

      • Q3 - Q1

      • Spread of the middle 50% of the data

      • Not influenced by outliers 

    • Standard deviation

      • Measure variability of the distribution and how far typical values are from the mean

      • High SD means most data is spread far from the mean

      • Low SD means most data is near the mean

      • Easily influenced by outliers

  • Outliers 

    • Two methods for determining outliers

      • Fence method: in which an outlier is a value greater than the upper fence or less than the lower fence 

        • Upper fence: Q3 + (1.5*IQR)

        • Lower fence: Q1 - (1.5*1QR)

      • 2 Standard Deviation method: an outlier is a value that is located 2 or more standard deviations above or below the mean

        • x̄ + 2 standard deviations (anything above is outlier)

        • x̄ - 2 standard deviations (anything below is outlier)

Graphical representation of summary statistics

  • Five number summary: min, Q1, median, Q3, and max

  • Can be used to create a box plot to summarize the data

  • Box plots can also potentially show you the skew of a data set (box more to the right can indicate right skew, and vice versa)

Comparing Distributions of a Quantitative Variable 

  • Compare shape, center, and spread + interpret them 

  • BE SPECIFIC (don’t just say 35, say 35 trees)

  • Some sets of data can be modeled with a density curve [used to model a set of data to give insight as to what the actual population the data is representing could possibly look like]

    • Ex: normal distribution curve 

  • Empirical rule: in normal distributions, 68% of the population is within the 1 standard deviation of the mean, 95% of the population is within 2 standard deviations of the mean, and 99.7% is within 3 standard deviations of the mean

    • Most all other data isn’t really necessary

  • Z score measure how many SDs above or below the mean could be (can be negative or positive)

    • Formula for z score: Z = (x-μ)/σ

    • Allows us to compare data better

    • P(z [<, >, or =] z score); ex: P(z<1.11), P(z>1.11), P(-0.56 < z < 1.11), P(z=1.11)

    • CALC FUNCTION FOR Z SCORES: 2nd → vars → normalcdf

      • Lower value: either z score or -99

      • Upper value: either z score or 99

      • μ: 0

      • σ: 1

    • If given z score, you could find the value that it represents through calc function invNorm

      • area: z score in decimal 

      • Or plug known numbers into the z score formula and calculate from there