R and Shiny - Own Notes
1.2 - Objects
R is an object-oriented language, meaning everything in R is an object, e.g:
numbers
datasets
regression results
plots
vectors
What is a Class?
Every object has a class [type].
numbers → numeric
text → character
TRUE/FLASE → logical
dataset → data frame
What is a Method?
A method is something you do to an object. E.g.
plot()
plot(data)
plot(regression)This same command will work on many objects.
The 2 data structures we will mainly use are vectors and data frames.
Vectors
A vector is simply a list of values of the same type. A numeric vector would be (2 ,3, 5, 2, 4). A character vector would be (dog, cat, bird, snake). A logical vector would be (TRUE, FALSE, FALSE, TRUE)
An important rule in R → A vector must contain the same type of data.
Data Frames
A data frame is a table. It is made of vectors combined together. E.g.
Name | Age | Smoker |
Bob | 24 | TRUE |
Jon | 39 | FALSE |
Each column is a vector.
Name = vector
Age = vector
Smoker = vector
Combine them → data frame
1.3 - Importing Data
Normally, you import data from files. E.g.
JJJ <- read.csv(file="JJJ.csv", head=TRUE)To see the dataset:
JJJ1.4 - Exporting Data
To save an R dataset as a file:
write.csv(first_data_set,"~/Desktop/first_data_set.csv")Chapter 2
2.1 - Procedures
Most operations in R are functions applied to vectors.
From Ch.1:
vectors = columns
data frames = tables made of vectors
So, when you run a statistical procedure, R is usually doing something like function(vector) or function(column_of_dataset). E.g.
mean(MMM$Age)→ means calculate the mean of the Age column
2.2 - Utility Procedures
These are basic commands used to inspect a dataset. Before doing analysis, you often want to know:
what variables exist
what dataset looks like
how many rows
what types of variable
First, the dataset is imported:
MMM -> read.csv(file="http://ssa.cf.ac.uk/MAT514/Data/MMM.csv",head=TRUE)→ It tells us theres 13 observations with 8 variables
To see column names of a dataset:
names(MMM)Output:
[1] "Name" "Age"
[3] "Sex" "Height.in.Metres"
[5] "Weight.in.Kg" "Home.Postcode"
[7] "Savings.in.Pounds" "Random.Number"
All this answers is what variables exist in the dataset?
We can also do:
head() → prints first 6 rows
tail() → prints last 6 rows
str() → shows no. of obs, no. of variables, variable types and sample values [one of most useful commands in R]
2.2.2 - Descriptive Statistics
The easiest way to get quick statistics is:
summary(dataset)summary(MMM)This produces summaries for every variable.
Access Specific Column
To analyse a single variable you use:
dataset$columnMMM$Age This extracts the Age vector.
Output:
[1] 9 76 45 44 11 24 26 104 51 19 37 52 27Once you extract the vector, you can apply any function. E.g.
length(MM$Age) → no. of obs
sd() → standard deviation
min(), max()
sum() → total of all values
var() → variance
Grouped Calculations
Sometimes we want statistics for groups. E.g. Mean age by sex.
by(data , grouping_variable , function)by(MMM$Age , MMM$Sex , mean)which means group ages by sex, and calculate mean for each group.
Output:
MMM$Sex: Female
[1] 24
------------------------------------------
MMM$Sex: M
[1] 28We can also use tapply() instead of by()
2.2.3 - Frequency Tables
Now we analyse categorical data. We will use a new dataset.
math_tests <- read.csv(file="http://ssa.cf.ac.uk/MAT514/Data/math_tests.csv",head=TRUE)names(math_tests)Output:
[1] "Name" "Teacher" "Pass.Fail"To create a table → table(). E.g.
table(math_tests$Teacher , math_tests$Pass.Fail)Output:
F P
Mr Evans 2 4
Mr Smith 1 4Meaning Mr Evans has 2 fails and 4 passes.
Saving Table
mytable <- table(math_tests$Teacher , math_tests$Pass.Fail)[Name ← Code to create table] → Now the table becomes an object
Counts and Proportions
Total row count:
margin.table(mytable , 1)Output:
Mr Evans Mr Smith
6 5 Total Column count:
margin.table(mytable , 2)Output:
F P
3 8 Row Proportion:
prop.table(mytable , 1)
Output:
F P
Mr Evans 0.3333333 0.6666667
Mr Smith 0.2000000 0.8000000Column Proportion
prop.table(mytable , 2)
Output:
F P
Mr Evans 0.6666667 0.5000000
Mr Smith 0.3333333 0.50000002.2.4 - Correlations
Correlation measures relationship between variables. Correlation only works for numeric variables.
Selecting numeric columns
To only keep numeric variables:
MMM[, sapply(MMM , is.numeric)]To save:
MMMnum <- MMM[, sapply(MMM , is.numeric)]Correlation Matrix
cor(MMMnum)
Output:
correlation between every pair of numeric variables Testing Correlation Significance
To perform Pearson correlation test:
cor.test(MMM$Age , MMM$Height.in.Metres)
Output:
Pearson's product-moment correlation
data: MMM$Age and MMM$Height.in.Metres
t = -1.1598, df = 11, p-value = 0.2707
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7454925 0.2699957
sample estimates:
cor
-0.3300959 Output includes:
correlation value
p-value
confidence interval
Using Packages
Some functions come from external packages. To install and load:
install.packages("Hmisc")
library(Hmisc)rcorr()
This function calculates correlations and p-values, but it requires matrix input.
So, we convert into a matrix and then run it:
MMMmat <- as.matrix(MMMnum)
rcorr(MMMmat)2.2.5 - Linear Models
For this, we will open a new dataset:
JJJ <- read.csv(file="http://ssa.cf.ac.uk/MAT514/Data/JJJ.csv",head=TRUE)
names(JJJ)
Output:
[1] "Name" "Age"
[3] "Sex" "Height.in.Metres"
[5] "Weight.in.Kg" "Home.Postcode"
[7] "Savings.in.Pounds" "Random.Number" R can fit regression models.
General syntax: lm(outcome ~ predictors). E.g.
lm(Height.in.Metres ~ Weight.in.Kg + Savings.in.Pounds , data=JJJ)Meaning height predicted by weight and savings
We can also do a full regression output summary(lm(…)). This shows:
coefficients
p-values
R2
residual error
We can also do ANOVA - used to test differences between groups.
If we wanted to know if grade depends on professor:
aov(GRADE ~ PROF , data=mathTo see full output:
summary(aov(..))2.2.6 - Plots
R can also make graphs.
Histogram → shows distribution of heights:
hist(JJJ$Height.in.Metres)Scatter plot → shows relationship between variables:
plot(JJJ$Weight.in.Kg , JJJ$Height.in.Metres)2.3 - Exporting Graphs
png("plot.png")
plot(...)
dev.off()E.g:
png("height_vs_weight_plot.png")
plot(JJJ$Weight.in.Kg , JJJ$Height.in.Metres)
dev.off()This will save the graph as a PNG file.
Chapter 3 - Manipulating Data
3.1 - Vectors and Data Frames
Remember → data frame = collection of vectors
Example dataset:
Name | Age | Height |
John | 20 | 1.8 |
Mary | 21 | 1.7 |
Each column is a vector.
3.1.1 - Vectors
Important idea → Recycling
When vectors have different lengths, R repeats the shorter one.
v <- c(1,2,3,4,5)
u <- c(0,1)
u+vAs v and u are different lengths, R will recycle u. So, u will become:
v <- c(1,2,3.4,5)
u <- c(0,1,0,1,0)
Output:
1 3 3 5 5We can also do:
v <- c(1,2,3.4,5)
u <- c(0,1,0,1,0)
v < 4
Output:
TRUE TRUE TRUE FALSE FALSE TRUE/FALSE are important because they allow for filtering.
Indexing
Indexing means selecting elements by position. E.g:
dwarfs <- c("Dopey","Sneezey","Happy","Sleepy","Grumpy","Bashful","Doc")Therefore, the vector positions respectfully are Doc=1, Sneezey=2,…,Doc=7.
Selecting Elements
dwarfs[c(1,4,5)]
Output:
Dopey Sleepy GrumpyThis means give element 1, 4, and 5.
Using Ranges
dwarfs[3:length(dwarfs)]
Output:
"Happy" "Sleepy" "Grumpy" "Bashful" "Doc" This mean give from element 3 to length(dwarfs), which is the length of the entire vector, which is 7. So it gives from element 3 to element 7.
Boolean Indexing
Index <- c(TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE)
dwarfs[Index]
Output:
Dopey DocMeaning keep everything that is true, and skip everything that is false.
Combining Filtering and Functions
Index <- substr(dwarfs,1,1) == "D"
dwarfs[Index]
Output:
Dopey DocLet’s understand this:
substr(dwarfs,1,1)This mean to “take substring starting at position 1, length 1” which mean extracting the first letter of each name. This would give:
D S H S G B DThen we compare with:
== "D"
Output:
TRUE FALSE FALSE FALSE FALSE FALSE TRUESo only vector 1 and 7 are true, which mean we keep those ones, giving us:
Dopey Doc Applying to Data Frames
Everything we just did with vectors works with data frames.
Selecting Columns
MMM[c(1,2,4,6)This will return us with the columns those numbers correspond to
Dropping Columns
MMM[c(-4,-6)You can also select by column name [MMM[c(“Name” , “Age”] → This is safer in real analysis.
Selecting columns using %in%
Index <- names(MMM) %in% c("Weight.in.Kg","Height.in.Metres")
MMM(Index)
Output:
Height.in.Meteres
Weight.in.KgWith this, R will return whatever variables inside the brackets. You can do the inverse of this [i.e. R will return everything except what is in the brackets] by doing:
MMM[!Index]3.1.2 - Selecting Observations [Rows]
To select a specific cell:
dataframe[i,j]Meaning row i, column j. For example:
JJJ[7,2]
Output:
3Selecting Entire Rows:
JJJ[7,]Sorting Data
JJJ[order(JJJ$Age),]This will reorder the dataset [JJJ] by Age.
Filtering Rows
JJJ[JJJ$Age <= 18 , ]This will allow you to select rows where the age is less than or equal to 18. Since the column part is blank, it will return all columns.
3.2 - Concatenating Datasets
Concatenate = stack datasets
MMMJJJ <- rbind(JJJ,MMM)So the dataset becomes MMM rows + JJJ rows. However, an important rule is that both datasets must have the same columns.
3.2.1 - Merging Datasets
Merge = combine datasets by key variable
merge(first_data_set,other_data_set, "Name")This will join the datasets matching rows where Name is the same. This is called an inner join.
Different Join Types
Left join → keep all rows from first dataset
merge(..., all.x = TRUE)Right join → keep all rows from second dataset
merge(..., all.y = TRUE)Full join → keep all rows from both datasets
merge(..., all = TRUE)3.3 - Creating New Variables
The formula for BMI = weight / height2
To create a new column called BMI:
MMM_With_BMI <- MMM
MMM_With_BMI$BMI <- MMM$Weight.in.Kg/(MMM$Height.in.Metres)^2Useful Functions
abs(x) absolute value
floor(x) integer part
log(x) natural log
log10(x) log base 10
round(x,2) round to 2 decimals
sqrt(x) square rootString Operations
substr(string,start,length)We can convert Male → M, and Female → F using:
MMM_With_BMI$Sex <- substr(MMM_With_BMI$Sex,1,1)3.3.1 - Renaming Variables
library(reshape)
JJJ<-rename(JJJ,c(Sex="Gender"))3.3.2 - Operations across Rows
Cumulative Sum
To do this, we will open a new dataset:
birthday <- read.csv(file="https://ssa.cf.ac.uk/MAT514/Data/birthday_money.csv",head=TRUE)To find cumulative sum:
birthday$total <- cumsum(birthday$Amount)
birthday$total
Output:
[1] 100 250 370 370 870 875Differences
diff(birthday$Amount)
Output:
[1] 50 -30 -120 500 -495But this will return one fewer value so we add NA:
birthday$yearly_diff <- c(NA, diff(birthday$Amount))
birthday$yearly_diff
Output:
[1] NA 50 -30 -120 500 -4953.4 - Handling Dates
To convert from string/factors to real dates:
birthday$Birthday <- as.Date(birthday$Birthday,"%d/%m/%Y")Sorting by Date
Sorts dataset by birthday:
birthdays[order(birthdays$Birthday),]Difference between dates
Returns number of days between birthdays:
diff(birthdays$Birthday[order(birthdays$Birthday)])