2. Descriptive Statistics

Variable = numerical data collected on a theoretical concept of interest

Categorical variables (measured in groups/categories)

  1. Nominal - values used to represent categories, but no order (e.g. eye color, icecream flavor)

  2. Ordinal - values have an order/ hierarchy, but we don’t know how big the difference bet. the things is (e.g. shirt sizes)

Continuous variables (measures on a continuous scale)

  1. Interval - equal differences in values = equal difference in property, but 0 does not indicate abscence of property (e.g. IQ test - 0 does not mean no intelligence)

  2. Ratio - like interval, but has an absolute 0 and measurements can be compared by calculating ratios (e.g. height)

Describing data
  • graphs

  • numerical values describing specific features - descriptive stats

  1. Bar charts - nominal level measurement

Proportion/percentages

  • absolute frequency (count) = no. of times a value is observed

  • relative frequency (percentage) = no. of times value is observed, as a percentage (aka relative to total number of observations)

  • valid frequency = relative frequency compared to people who gave a valid answer

  • cumulative frequency = add relative frequencies of a group to percentages of prev group

Mode = the value that occurs the most (only for categorical data)

  1. Histogram - interval/ratio

Mean = sum of values divided by no. of values (only for interval/ratio data)

Measure of center = the point around which most data is concentrated (e.g mean, mode, median)

Spread = how much data values differ from each other, and for how much data values differ from the measure of centre (big spread = data varies more)

Measure of spread: range, mean absolute deviation, variance, standard deviation

Variance = the mean squared deviation of values, with respect to the mean (flaw: unit of variance is dif. from unit of variable, so it’s harder to interpret) (small variance = small spread)

Standard deviation (sigma σ) = square root of variance = measure of spread around the mean (perk: more emphasis to extreme)

  1. do Sum of Squares: calculate the mean, for every score subtract the mean (these are deviations), square all the deviations, add all these squared deviations to get Sum of Squares

  2. find variance: σ2 = SS devided by n

  3. find root to get st. dev.: σ =square root of variance​

    1. Box plot - interval/ratio

Median = order values low to high, count how many they are, divide by 2 and round up, count from beginning to that value // if even no. of values, take mean of middle 2 values

Range = maximum value - minimum value

Quartiles - order values, find median (Q2), find medians of those halves (Q1, Q3)// for even, medians will be means of values

Interquartile range (IQR) = middle 50% of data bet. Q1 and Q3

Outlier - can heavily influence mean and standard dev., but median stays the same

  1. Nested bar chart - for nominal data

Proportions & conditional proportions

  1. Side-by-side boxplots - nominal + interval/ratio

Median & IQR

  1. Scatter plot - interval/ratio

Correlation coefficient (Pearson's r) - how closely points on scatterplot resemble a straight line
  • [-1, +1] - sign shows direction of linear relationship, magnitude shows strength (closer to 1 = strong rel)

  • !! no correlation =/= no relation + careful w. outliers

robot