1/114
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
PPDAC
problem, plan, data, analysis, conclusion
Experimental Journey
Observation, question, background research, identify variables, hypothesis, experimental design, predictions
Steps to test hypothesis
Import data, 2. Tidy the data, 3. Look at the data 4. Test Hypotheses
Fundamental parts of coding
object, function, new object (input, process, output)
Objects in R
an object is anything that stores data in R, which you can assign a name too
Object Rules
allowed characters: letters, numbers, . and _
Must start with a letter
No spaces
case sensitive
←
Function
code that commands an operation and gives an output
Function Rules
uses parentheses
data goes inside parantheses
Continuous data
can take any value within a range, infinite possible values (height, weight, time)
Count
whole numbers only, represents how many (number of students)
Categorical
groups or labels with no inherent order (eye color, specifies, blood type)
Binomial
two possible outcomes (yes/no)
ordinal
distance between them is not equal or known (rankings, class level, pain scale) not numerical distances
tidy data
each variable in its own column, each observation in its own row, each value is in one cell
Bar Plot
relationship between 1 continous variable and 1 or more catrotgical variables
Scatterplot
relationship between 2 continous variables
line graph
relationship between 2 contintous variables, 1 variable is ordered (usually time)
Historgram
distribution of 1 continous variable
Bar and violin plots
visualize the distribution of a continuous variable for one or more categories
descriptive statistics
a set of summary measurements that simply communicate important information like centrality and variation
data type: non numerical
porportions, percentages, rations
data type: numerical
mean, median, mode
Mean
the average
Median
the middle of all ranked values (most robust)
Mode
the most common value
Robust
an overall measure being resilient to single values
residuals
oberservation minus the mean to predict future yields
range
the difference between the largest and the smallest value, range()
interquartile range
the range of the middle 50% of data, bar plot
variance
measures how far data points are from the mean on average (squared distance), var()
standard deviation
the square root of variance, average distance from the mean, sd()
Sample Mean
calculated from data, to estimate true mean
true mean
actual average of the entire population, usually unknown
Sample Standard Deviation
how spread out sample data is
True standard Dev
uses N (full population), usually unkown
Uncertaintiy a sample is accurate
standard error, confidence intervals, pvalues, statistical power
normal distribtuion
bell curve, are everywhere
Central Limit Theorem
assume you sample a population many times independently and each time you calc a sample mean, the distrubtion of those means will be normally distributred
standard error
how much the sample mean is expected to vary from the true population mean (how accurate sample mean is)
Confidence intervals
a range of values used to estimate the true population parameter (where the true mean is likely to be)
A 95% confidence means that 95% of the intervals would contain the true mean
as sample size increases, CL gets more percise
as variability increases, CL become more uncertain
independent variables
scientist change this factor on purpose to see what happens
x axis
dependent variable
measure to see if the IV made a difference
y axis
what changes
hypothesis
testable and falsifiable statement that explains a possible relationship between 2 or more variables based on existing knowledge
null hypothesis
the assumption that there is no effect, no differences, or no relationship
Parametric tests
t-test, ANOVA, chi-squared
Homodscedasticity
normal variance
heteroscedasticity
unequal variance
transforming data
change all values of a variable in an identical way mathematically, not changing the relationship between values
ex) square root, natural log
back transformation
transforming data and then undoing it using the reverse transformation
normality test
statistical method used to determine whether a dataset follows a normal distribution
Kolmogorov Smirnov test
genertates an ideal distribution using parameters drawn from our data, and we then compare this to our data
outliers
data that exists outside of the typical distribution of data
impossible values or in the 1.5 interquartile range
trimming
removes outliers
when outliers are extreme or impossible
winsorization
replaces outliers with less extreme values, such as the 5% and 95%
when extreme outliers still hold biological significance
Confusion matrix
a table that compares reality to wwhat your data set or test concludes
True postive
effect is real and the effect is detected
false positive
data shows effect, no effect in reality
false negataive
no effect in data but effect is real
true negative
no effect in reality to effect in data
test statistics
any calculated value that measures the difference between experimental groups (control vs treatment)
larger=more likely the null is false
difference between groups over amount of variation
z-score
are two populations means significantly different
t-score
are two sample means significatly different
F-score
are any of 3 or more samples means different
chi-squared
does the observed data match an expected distribution
p-value
assuming the null hypothesis is true, the p-value represents the probability you would have gotten your measured test statistic or smth greater by random chance
false negative
type 2 error
false positive
type 1 error
Scatterplot
use when both variables are continuous
can see clustering
direction of the relationship
ex: petal length and petal width
Covariance
shows how to variables vary together
if both increase together, the covariance is postive
if one increases and on decreases, the covariance is negative
covariance of zero means no consistent relationship
Variance vs Covariance
Variance only measures the spread of a single variable around its mean, while covariance measures how two variables vary together
Correlation
measures the direction and strength of a linear relationship between two variables
r value near 1 is strong positive relationship
r -1 is a strong negative
0 means no relationship
Correlation vs Regression
if you want to know if two variables are related, use correlation
if you want to predict one variable from another, use regression
Linear Regression
model that shows how a dependent variable changes as an independent variable changes
y= mX + b
y is dependent variable
x is independent
m is slope: how much the dependent variable changes for every one unit increase of the independent
b is intercept
residuals
measure the difference between the observed value and the predicted value
Residual = observed - predicted
attempts to minimize the sum of squared residuals using ordinary least squares
prevents postive and negative errors from canceling each other out
R
correleation coefficient
R squared
measures explanatory power
proportion of variation in the dependent variable that be explained by the independent variable
the larger the R the better the model is predicting
Regression Assumptions
model must be linear
normally distributed
homoscedacity, spread should be consistent
elimant outliers etc
ANOVA
independent variable is categorical with more than two groups
dependent variable is continuous
researchers may test whether different chicken feed types produce different average chicken weights
Factor and levels
categorical independent variable
levels would be the categories within the factor
ANOVA assumptions
-independent, normality, equal variance (Kolmogroov Smirnov Test)
F statistic
between group variance over within group variance
f value become large is between group variance is larger than within group, and the p-value decreases
Positive control
expected to produce an effect
negative control
expected to produce no effect
correlation study
observes variables without directly changing them
ex: studying whether noise levels are associated with poor sleep in ICU
do not assign noise levels
observe existing conditions
correlation does not prove causation, may be confounding variables
Manipulative study
researches directly manipulate the independent variable
ex: assign one group to a new exercise regimen and another group is the control
stronger evidence for causation
retrospective studies
looks backward and uses existing records
less control over variable
prospective
follow subjects forward in time
Field experiments
occur in natural enviroments
more realistic
less control
more confounding variables
Labratory experiments
occur in controlled settings
easier to isolate variables
may not reflect real world conditions
In Vivo
living organism
In vitro
means outside a living organism, lab dish or test tube
randomized single factor
experiment with on independent variable (factor) where subjects are randomly assigned to treatment groups.
case control study
observational study that works backwards
start with people who already have an outcome and compare them to people who dont, look back at what different
repeated measures design
same subjects measured multiple times across different conditions or time points
each person serves as their own control
cross over design
type of repeated measures design where subjects switch between treatments in a sequence, with a wash out period, each subject experiences each treatment
quasi experiement
resembles a experiment but lacks a full random assignment
doesnt get full control of who gets which treatment
when randomization is unethical or impracticle
factorial design
two or more independent vairables tested simulatanelous, allowin g to examain main effects and interactions between factors
bootstrapping method
A resampling method that repeatedly draws samples (with replacement) from your existing data to estimate a statistic's distribution — without assuming normality
factorial design
A factorial design is an experimental design where researchers study the effects of two or more independent variables simultaneously.
Example:
A researcher wants to see how sleep and caffeine affect test scores.
Factor 1: Sleep
4 hours
8 hours
Factor 2: Caffeine
No caffeine
Coffee
Blocking
Blocking = grouping experimental subjects based on a characteristic that could affect the response variable, then randomly assigning treatments within each group.