Review for Exam 1 - do review quizzes again

General: Describe the information

Population, sample, cases
The basic information we start with is:
WHAT is this data all about?
WHERE did it come from?
HOW is it organized?
Is there a QUESTION being asked?
From this QUESTION - What do we know of the POPULATION of Interest?
How was this Data Obtained? - SAMPLING METHODS
What is the Sample comprised of? CASES

When there is x rows, there is x cases unless data is inputted incorrectly

Data - Data Types

The Data that we get is made up of identification information and variables.
Variables can be NUMERIC / QUANTITATIVE or CATEGORICAL/
QUALITATIVE. They may also be broken down into subgroups.
NUMERIC/QUANTITATIVE data can be
- Continuous - things that have infinite measurement potential
- Discrete - we can't have 1/2 of a person
CATEGORICAL/ QUALITATIVE data can be
- Ordinal - are placed in a specific order
  - ex. Year
- Nominal - are names of things
- You can do most visualizations once you have a number
- Any time there’s a comparison, there’s a side-by-side

Data - Visualization

We visualize our data in different ways based on what type of data it is.
- Categorical Data - mosaic Charts , bar charts, pie charts
- Numeric Data - histograms, box plots, dot plots, density plots
  - For a scatterplot, you need two quantitative (numeric) variables so you can see how one changes with respect to the other.
  - KNOW WHAT THESE LOOK LIKE
  - List how different variable types can apply to each visualization

Data - Summary / Descriptive Data / 5# summary / contingency tables

We summarize our data in different ways based on what type of data it is.
Categorical Data - contingency tables, frequency tables, 2x2 tables
Numeric Data - Numeric values that describe the center, spread, and shape
- Mean/Median/Mode Σ(X)/N — loc N/2 + 1 —- highest frequency
- Variation, Standard Deviation, MAD, Range, IQR
- Skewness (le /neg, right/pos)

Numeric

Mean (X1+X2+X3+…Xn)/N
Median - location at N/2+1 or the average of the two center points
Mode - highest frequency
5-number summary: Min, Q1, Med, Q3, Max

Range: Max-Min

InterQuartile Range (IQR) = Q3-Q1

The formulas with greek letters have to do with a population

Outliers: Q1-1.5xIQR and Q3-1.5xIQR

Upper bounded data is a set of numerical values where there is a known, maximum limit that the data points

Calculating Standard Deviation

x = each data point

mean = the average of all the data points

(x - mean) = how far each point is from the mean

(x - mean)^2 = squared difference

sum( (x - mean)^2 ) = add up all the squared differences

n - 1 = number of data points minus 1 (degrees of freedom)

Always draw a picture when you do z-scores

Data - Relationship Plotting

We can investigate the relationship between variables by plotting them on a x/y plot where the RESPONSE variable is on the Y axis and the EXPLANATORY variable(s) is/are on the X axis

Experiment (Expt) - Hypotheses

Why are you doing this experiment in the first place?

Create a set of hypotheses around the question we are interested in answering

H0 : The data is independent aka Status Quo
Nothing will change how things are H0: μ1 = μ2
Ha : The data is not independent aka Something Changes
Ha: μ1 ≠ μ2

Expts - Types of experiments

Two main type of study:

Observational

Designed Experiment (sometimes referred to only as Experiment - this is a trap)

We will not consider anecdotal evidence as a type of study

Expt - Types of sampling methods

Simple Random: random selection of a number of cases/observational units

Stratified: First create groups within the population, and then randomly select from each group

groups are created, not naturally occurring
- ex. a lab professor believes the section a student is in might affect how they feel about the course

Cluster: break up the population into multiple naturally occurring groups (clusters), and select some of the clusters, including a data from EVERYONE in the cluster

ex. groups of college students who live in dorms by year (freshman, sophomore, etc.) if you want to equally represent students from all years

Multistage sampling: break up the population into multiple groups (clusters), and select some of the clusters, including data from SOME members of the cluster

Expts - X/Y aka Ind/Dep aka Expl/Response

Based on the Question we asked, we de ne our Response Variable - or Y - or Dependent (because it is dependent on the changes of we make to our explanatory variables

USUALLY the Y or RESPONSE is depicted on the Y axis of our visualization

USUALLY the EXPLANATORY variable(s) are on the X axis of our visualization