Review for Exam 1 - do review quizzes again

General: Describe the information

  • Population, sample, cases

  • The basic information we start with is:

    WHAT is this data all about?

    WHERE did it come from?

    HOW is it organized?

    Is there a QUESTION being asked?

    From this QUESTION - What do we know of the POPULATION of Interest?

    How was this Data Obtained? - SAMPLING METHODS

    What is the Sample comprised of? CASES

When there is x rows, there is x cases unless data is inputted incorrectly

Data - Data Types

  • The Data that we get is made up of identification information and variables.

    Variables can be NUMERIC / QUANTITATIVE or CATEGORICAL/

    QUALITATIVE. They may also be broken down into subgroups.

    NUMERIC/QUANTITATIVE data can be

    • Continuous - things that have infinite measurement potential

    • Discrete - we can't have 1/2 of a person

    CATEGORICAL/ QUALITATIVE data can be

    • Ordinal - are placed in a specific order

      • ex. Year

    • Nominal - are names of things

    • You can do most visualizations once you have a number

    • Any time there’s a comparison, there’s a side-by-side

Data - Visualization

  • We visualize our data in different ways based on what type of data it is.

    • Categorical Data - mosaic Charts , bar charts, pie charts

    • Numeric Data - histograms, box plots, dot plots, density plots

      • For a scatterplot, you need two quantitative (numeric) variables so you can see how one changes with respect to the other.

      • KNOW WHAT THESE LOOK LIKE

      • List how different variable types can apply to each visualization

Data - Summary / Descriptive Data / 5# summary / contingency tables

  • We summarize our data in different ways based on what type of data it is.

  • Categorical Data - contingency tables, frequency tables, 2x2 tables

  • Numeric Data - Numeric values that describe the center, spread, and shape

    • Mean/Median/Mode Σ(X)/N — loc N/2 + 1 —- highest frequency

    • Variation, Standard Deviation, MAD, Range, IQR

    • Skewness (le /neg, right/pos)

Numeric

  • Mean (X1+X2+X3+…Xn)/N

  • Median - location at N/2+1 or the average of the two center points

  • Mode - highest frequency

  • 5-number summary: Min, Q1, Med, Q3, Max 

Range: Max-Min

InterQuartile Range (IQR) = Q3-Q1

The formulas with greek letters have to do with a population

Outliers: Q1-1.5xIQR and Q3-1.5xIQR

Upper bounded data is a set of numerical values where there is a known, maximum limit that the data points

Calculating Standard Deviation

x = each data point

mean = the average of all the data points

(x - mean) = how far each point is from the mean

(x - mean)^2 = squared difference

sum( (x - mean)^2 ) = add up all the squared differences

n - 1 = number of data points minus 1 (degrees of freedom)

Always draw a picture when you do z-scores

Data - Relationship Plotting

We can investigate the relationship between variables by plotting them on a x/y plot where the RESPONSE variable is on the Y axis and the EXPLANATORY variable(s) is/are on the X axis

Experiment (Expt) - Hypotheses

Why are you doing this experiment in the first place?

Create a set of hypotheses around the question we are interested in answering

  • H0 : The data is independent aka Status Quo

  • Nothing will change how things are H0: μ1 = μ2

  • Ha : The data is not independent aka Something Changes

  • Ha: μ1 ≠ μ2

Expts - Types of experiments

Two main type of study:

Observational

Designed Experiment (sometimes referred to only as Experiment - this is a trap)

We will not consider anecdotal evidence as a type of study

Expt - Types of sampling methods

Simple Random: random selection of a number of cases/observational units

Stratified: First create groups within the population, and then randomly select from each group

  • groups are created, not naturally occurring

    • ex. a lab professor believes the section a student is in might affect how they feel about the course

Cluster: break up the population into multiple naturally occurring groups (clusters), and select some of the clusters, including a data from EVERYONE in the cluster

  • ex. groups of college students who live in dorms by year (freshman, sophomore, etc.) if you want to equally represent students from all years

Multistage sampling: break up the population into multiple groups (clusters), and select some of the clusters, including data from SOME members of the cluster

Expts - X/Y aka Ind/Dep aka Expl/Response

Based on the Question we asked, we de ne our Response Variable - or Y - or Dependent (because it is dependent on the changes of we make to our explanatory variables

USUALLY the Y or RESPONSE is depicted on the Y axis of our visualization

USUALLY the EXPLANATORY variable(s) are on the X axis of our visualization