1/50
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
sqrt(x)
square root
abs(x)
absolute value of x
length(x)
the amount of numbers
min(x)
the lowest value of the numbers
sum(x)
the sum of the numbers
str(dataset_name)
structure of the data- shows the number of rows and columns in the data, the variable names, and R's classification of each variable
head(dataset_name, n)
shows the first n rows of the data set (shows 6 if n is not specified)
tail(dataset_name, n)
shows the last n rows of the data set (shows 6 if n is not specified)
ggplot()
creates graphical summaries
library(tidyverse)
loads the tidyverse library
geom_histogram()
histogram graph
geom_boxplot()
boxplot graph
ggplot(iris, aes(x = Petal.Width, fill = Species)) + geom_histogram() + labs(title = "Sliced histogram of Petal Width by Species")
Iris histogram example
mean(x)
mean
median(x)
median
sd(x)
sample standard deviation
library(rafalib), popsd(dataset_name$variable)
population standard deviation
fivenum(x)
minimum, Q1, median, Q3, maximum
IQR(x)
IQR
LT = Q1 - 1.5 * iqr_value
Lower threshold outlier
UT = Q3 + 1.5 * iqr_value
upper threshold outlier
as.factor(), e.g. mtcars$vs = as.factor(mtcars$vs)
class(mtcars$vs)
converting numeric to a factor
char_list = c("1", "2", "3", "4")
character list
?dataset_name
help page for the data set
class(dataset_name$variable)
to see how r has classified a certain variable
pnorm(x=…, mean=.., sd=…, lower.tail=…)
works out the probability/the area under the curve up to a certain point/ the Normal cumulative distribution function (cdf). Defaults are mean=0, sd = 1, lower.tail=TRUE
round(number, n)
round an answer to n decimal places
qnorm(x=…, mean=…, sd=…, lower.tail=…)
gives you the value (quantile) below which a certain percentage of data from a Normal distribution falls. Defaults are mean=0, sd = 1, lower.tail=TRUE
filter()
filtering data
mutate()
add new columns or modify existing ones
cor(x,y)
finds the correlation coefficient
lm(y ~ x, data=df)
creates a linear regression model where y is the dependent variable, x is the independent variable and df is the data.frame
geom_smooth(method="lm", se = T/F, color=…)
adding a regression line to a scatter plot where se is standard error (always put false) and method="lm" means we want a linear regression line in particular
geom_point()
makes a residual plot
geom_hline(yintercept=0)
creates a horizontal line at 0
ggplot(model, aes(x=.fitted, y=.resid)) + geom_point() + geom_hline(yintercept=0, linetype= "dotted", color="red")
scatter residual plot
sample(x, size, replace = T/F, prob=)
modelling random events where x is the data from which you sample, size is the size of your sample, replace is with or without replacement, prob (optional) is the set probability for each chance
replicate(x, function)
to repeat a function multiple times where x is the number of repeats
set.seed(x)
makes results reproducible where x is any number (question will tell you what to use)
dbinom(x,n,p)
calculates P(X=x) where x is the number of events we want, n is the number of trials and p is the probability of success
pbinom(x,n,p)
combines many dbinom() to calculate P(X is less than or equal to x) where x is the number of events we want, n is the number of trials and p is the probability of success
Box=c(1,0)
to define your box
Set.seed(1)
Box=c(1,0)
Sum(sample(box, 10000, replace=TRUE)
to simulate 1000 tosses
cumsum()
calculates the cumulative sum
rep(a,b)
to create large boxes where a is the number to be repeated and b is how many times it is repeated