R and Shiny - Own Notes

1.2 - Objects

R is an object-oriented language, meaning everything in R is an object, e.g:

  • numbers

  • datasets

  • regression results

  • plots

  • vectors

What is a Class?

Every object has a class [type].

  • numbers → numeric

  • text → character

  • TRUE/FLASE → logical

  • dataset → data frame

What is a Method?

A method is something you do to an object. E.g.

plot()
plot(data)
plot(regression)

This same command will work on many objects.

The 2 data structures we will mainly use are vectors and data frames.

Vectors

A vector is simply a list of values of the same type. A numeric vector would be (2 ,3, 5, 2, 4). A character vector would be (dog, cat, bird, snake). A logical vector would be (TRUE, FALSE, FALSE, TRUE)

An important rule in R → A vector must contain the same type of data.

Data Frames

A data frame is a table. It is made of vectors combined together. E.g.

Name

Age

Smoker

Bob

24

TRUE

Jon

39

FALSE

Each column is a vector.

  • Name = vector

  • Age = vector

  • Smoker = vector

Combine them → data frame

1.3 - Importing Data

Normally, you import data from files. E.g.

JJJ <- read.csv(file="JJJ.csv", head=TRUE)

To see the dataset:

JJJ

1.4 - Exporting Data

To save an R dataset as a file:

write.csv(first_data_set,"~/Desktop/first_data_set.csv")

Chapter 2

2.1 - Procedures

Most operations in R are functions applied to vectors.

From Ch.1:

  • vectors = columns

  • data frames = tables made of vectors

So, when you run a statistical procedure, R is usually doing something like function(vector) or function(column_of_dataset). E.g.

mean(MMM$Age)

→ means calculate the mean of the Age column

2.2 - Utility Procedures

These are basic commands used to inspect a dataset. Before doing analysis, you often want to know:

  • what variables exist

  • what dataset looks like

  • how many rows

  • what types of variable

First, the dataset is imported:

MMM -> read.csv(file="http://ssa.cf.ac.uk/MAT514/Data/MMM.csv",head=TRUE)

It tells us theres 13 observations with 8 variables

To see column names of a dataset:

names(MMM)

Output:

[1] "Name"              "Age"              
[3] "Sex"               "Height.in.Metres" 
[5] "Weight.in.Kg"      "Home.Postcode"    
[7] "Savings.in.Pounds" "Random.Number"

All this answers is what variables exist in the dataset?

We can also do:

  • head() → prints first 6 rows

  • tail() → prints last 6 rows

  • str() → shows no. of obs, no. of variables, variable types and sample values [one of most useful commands in R]

2.2.2 - Descriptive Statistics

The easiest way to get quick statistics is:

summary(dataset)
summary(MMM)

This produces summaries for every variable.

Access Specific Column

To analyse a single variable you use:

dataset$column
MMM$Age 

This extracts the Age vector.

Output:

[1]   9  76  45  44  11  24  26 104  51  19  37  52  27

Once you extract the vector, you can apply any function. E.g.

  • length(MM$Age) → no. of obs

  • sd() → standard deviation

  • min(), max()

  • sum() → total of all values

  • var() → variance

Grouped Calculations

Sometimes we want statistics for groups. E.g. Mean age by sex.

by(data , grouping_variable , function)
by(MMM$Age , MMM$Sex , mean)

which means group ages by sex, and calculate mean for each group.

Output:

MMM$Sex: Female
[1] 24
------------------------------------------ 
MMM$Sex: M
[1] 28

We can also use tapply() instead of by()

2.2.3 - Frequency Tables

Now we analyse categorical data. We will use a new dataset.

math_tests <- read.csv(file="http://ssa.cf.ac.uk/MAT514/Data/math_tests.csv",head=TRUE)
names(math_tests)

Output:

[1] "Name"      "Teacher"   "Pass.Fail"

To create a table → table(). E.g.

table(math_tests$Teacher , math_tests$Pass.Fail)

Output:

           F P
  Mr Evans 2 4
  Mr Smith 1 4

Meaning Mr Evans has 2 fails and 4 passes.

Saving Table

mytable <- table(math_tests$Teacher , math_tests$Pass.Fail)

[Name ← Code to create table] → Now the table becomes an object

Counts and Proportions

Total row count:

margin.table(mytable , 1)

Output:

Mr Evans Mr Smith 
       6        5 

Total Column count:

margin.table(mytable , 2)

Output:

F P 
3 8 

Row Proportion:

prop.table(mytable , 1)

Output:
 F         P
  Mr Evans 0.3333333 0.6666667
  Mr Smith 0.2000000 0.8000000

Column Proportion

prop.table(mytable , 2)

Output:
  F         P
  Mr Evans 0.6666667 0.5000000
  Mr Smith 0.3333333 0.5000000

2.2.4 - Correlations

Correlation measures relationship between variables. Correlation only works for numeric variables.

Selecting numeric columns

To only keep numeric variables:

MMM[, sapply(MMM , is.numeric)]

To save:

MMMnum <- MMM[, sapply(MMM , is.numeric)]

Correlation Matrix

cor(MMMnum)

Output:
correlation between every pair of numeric variables 

Testing Correlation Significance

To perform Pearson correlation test:

cor.test(MMM$Age , MMM$Height.in.Metres)

Output:
Pearson's product-moment correlation

data:  MMM$Age and MMM$Height.in.Metres
t = -1.1598, df = 11, p-value = 0.2707
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7454925  0.2699957
sample estimates:
       cor 
-0.3300959 

Output includes:

  • correlation value

  • p-value

  • confidence interval

Using Packages

Some functions come from external packages. To install and load:

install.packages("Hmisc")
library(Hmisc)

rcorr()

This function calculates correlations and p-values, but it requires matrix input.

So, we convert into a matrix and then run it:

MMMmat <- as.matrix(MMMnum)
rcorr(MMMmat)

2.2.5 - Linear Models

For this, we will open a new dataset:

JJJ <- read.csv(file="http://ssa.cf.ac.uk/MAT514/Data/JJJ.csv",head=TRUE)
names(JJJ)

Output:
[1] "Name"              "Age"              
[3] "Sex"               "Height.in.Metres" 
[5] "Weight.in.Kg"      "Home.Postcode"    
[7] "Savings.in.Pounds" "Random.Number" 

R can fit regression models.

General syntax: lm(outcome ~ predictors). E.g.

lm(Height.in.Metres ~ Weight.in.Kg + Savings.in.Pounds , data=JJJ)

Meaning height predicted by weight and savings

We can also do a full regression output summary(lm(…)). This shows:

  • coefficients

  • p-values

  • R2

  • residual error

We can also do ANOVA - used to test differences between groups.

If we wanted to know if grade depends on professor:

aov(GRADE ~ PROF , data=math

To see full output:

summary(aov(..))

2.2.6 - Plots

R can also make graphs.

Histogram → shows distribution of heights:

hist(JJJ$Height.in.Metres)

Scatter plot → shows relationship between variables:

plot(JJJ$Weight.in.Kg , JJJ$Height.in.Metres)

2.3 - Exporting Graphs

png("plot.png")
plot(...)
dev.off()

E.g:

png("height_vs_weight_plot.png")
plot(JJJ$Weight.in.Kg , JJJ$Height.in.Metres)
dev.off()

This will save the graph as a PNG file.

Chapter 3 - Manipulating Data

3.1 - Vectors and Data Frames

Remember → data frame = collection of vectors

Example dataset:

Name

Age

Height

John

20

1.8

Mary

21

1.7

Each column is a vector.

3.1.1 - Vectors

Important idea → Recycling

When vectors have different lengths, R repeats the shorter one.

v <- c(1,2,3,4,5)
u <- c(0,1)

u+v

As v and u are different lengths, R will recycle u. So, u will become:

v <- c(1,2,3.4,5)
u <- c(0,1,0,1,0)

Output:
1 3 3 5 5

We can also do:

v <- c(1,2,3.4,5)
u <- c(0,1,0,1,0)

v < 4

Output:
TRUE TRUE TRUE FALSE FALSE 

TRUE/FALSE are important because they allow for filtering.

Indexing

Indexing means selecting elements by position. E.g:

dwarfs <- c("Dopey","Sneezey","Happy","Sleepy","Grumpy","Bashful","Doc")

Therefore, the vector positions respectfully are Doc=1, Sneezey=2,…,Doc=7.

Selecting Elements

dwarfs[c(1,4,5)]

Output:
Dopey Sleepy Grumpy

This means give element 1, 4, and 5.

Using Ranges

dwarfs[3:length(dwarfs)]

Output:
"Happy"   "Sleepy"  "Grumpy"  "Bashful" "Doc" 

This mean give from element 3 to length(dwarfs), which is the length of the entire vector, which is 7. So it gives from element 3 to element 7.

Boolean Indexing

Index <- c(TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE)
dwarfs[Index]

Output:
Dopey Doc

Meaning keep everything that is true, and skip everything that is false.

Combining Filtering and Functions

Index <- substr(dwarfs,1,1) == "D"
dwarfs[Index]

Output:
Dopey Doc

Let’s understand this:

substr(dwarfs,1,1)

This mean to “take substring starting at position 1, length 1” which mean extracting the first letter of each name. This would give:

D S H S G B D

Then we compare with:

== "D"

Output:
TRUE FALSE FALSE FALSE FALSE FALSE TRUE

So only vector 1 and 7 are true, which mean we keep those ones, giving us:

Dopey Doc 

Applying to Data Frames

Everything we just did with vectors works with data frames.

Selecting Columns

MMM[c(1,2,4,6)

This will return us with the columns those numbers correspond to

Dropping Columns

MMM[c(-4,-6)

You can also select by column name [MMM[c(“Name” , “Age”] → This is safer in real analysis.

Selecting columns using %in%

Index <- names(MMM) %in% c("Weight.in.Kg","Height.in.Metres")
MMM(Index)

Output:
Height.in.Meteres
Weight.in.Kg

With this, R will return whatever variables inside the brackets. You can do the inverse of this [i.e. R will return everything except what is in the brackets] by doing:

MMM[!Index]

3.1.2 - Selecting Observations [Rows]

To select a specific cell:

dataframe[i,j]

Meaning row i, column j. For example:

JJJ[7,2]

Output:
3

Selecting Entire Rows:

JJJ[7,]

Sorting Data

JJJ[order(JJJ$Age),]

This will reorder the dataset [JJJ] by Age.

Filtering Rows

JJJ[JJJ$Age <= 18 , ]

This will allow you to select rows where the age is less than or equal to 18. Since the column part is blank, it will return all columns.

3.2 - Concatenating Datasets

Concatenate = stack datasets

MMMJJJ <- rbind(JJJ,MMM)

So the dataset becomes MMM rows + JJJ rows. However, an important rule is that both datasets must have the same columns.

3.2.1 - Merging Datasets

Merge = combine datasets by key variable

merge(first_data_set,other_data_set, "Name")

This will join the datasets matching rows where Name is the same. This is called an inner join.

Different Join Types

Left join → keep all rows from first dataset

merge(..., all.x = TRUE)

Right join → keep all rows from second dataset

merge(..., all.y = TRUE)

Full join → keep all rows from both datasets

merge(..., all = TRUE)

3.3 - Creating New Variables

The formula for BMI = weight / height2

To create a new column called BMI:

MMM_With_BMI <- MMM
MMM_With_BMI$BMI <- MMM$Weight.in.Kg/(MMM$Height.in.Metres)^2

Useful Functions

abs(x)      absolute value
floor(x)    integer part
log(x)      natural log
log10(x)    log base 10
round(x,2)  round to 2 decimals
sqrt(x)     square root

String Operations

substr(string,start,length)

We can convert Male → M, and Female → F using:

MMM_With_BMI$Sex <- substr(MMM_With_BMI$Sex,1,1)

3.3.1 - Renaming Variables

library(reshape)
JJJ<-rename(JJJ,c(Sex="Gender"))

3.3.2 - Operations across Rows

Cumulative Sum

To do this, we will open a new dataset:

birthday <- read.csv(file="https://ssa.cf.ac.uk/MAT514/Data/birthday_money.csv",head=TRUE)

To find cumulative sum:

birthday$total <- cumsum(birthday$Amount)
birthday$total

Output:
[1] 100 250 370 370 870 875

Differences

diff(birthday$Amount)

Output:
[1]   50  -30 -120  500 -495

But this will return one fewer value so we add NA:

birthday$yearly_diff <- c(NA, diff(birthday$Amount))
birthday$yearly_diff

Output:
[1]   NA   50  -30 -120  500 -495

3.4 - Handling Dates

To convert from string/factors to real dates:

birthday$Birthday <- as.Date(birthday$Birthday,"%d/%m/%Y")

Sorting by Date

Sorts dataset by birthday:

birthdays[order(birthdays$Birthday),]

Difference between dates

Returns number of days between birthdays:

diff(birthdays$Birthday[order(birthdays$Birthday)])