Lecture 6 - stats
Lecture 6 (Chapter 3 of MSMB)
High Quality Graphics in R
Goals for this Chapter
In this lecture we will:
Learn how to rapidly and flexibly explore datasets through visualization.
Create beautiful and intuitive plots for scientific presentations and publications.
Understand the logic behind the grammar of graphics concept.
Introduce ggplot2.
See how to plot data in one, two, or even three to five dimensions, and explore faceting.
Create "along-genome" plots for molecular biology data (or along other sequences, e.g., peptides).
R Packages We Will Use
Install the following library using the R command:
install.packages("ggplot2")
Base Plotting in R
Introduction to base plotting using the DNase dataset.
The DNase data frame includes:
176 rows and 3 columns of data.
Data obtained during the development of an ELISA assay for the recombinant protein DNase in rat serum.
Columns in DNase dataset:
Run: assay runconc: concentration of the proteindensity: measured optical density in the assay
The first seven rows of the dataset are presented as follows:
DNase[1:7, ]
## Run conc density
## 1 1 0.04882812 0.017
## 2 1 0.04882812 0.018
## 3 1 0.19531250 0.121
## 4 1 0.19531250 0.124
## 5 1 0.39062500 0.206
## 6 1 0.39062500 0.215
## 7 1 0.78125000 0.377
Base Plotting in R (continued)
Create a basic scatter plot using the
plotfunction.
plot(DNase$conc, DNase$density)
The x-axis represents
DNase$conc, and the y-axis representsDNase$density. The plot will display concentration versus density for the ELISA assay of DNase.Example Output:
0 2 4 6 8 12
DNase$conc
0.0 1.0 2.0
DNase$density
Base Plotting in R (continued)
Change the axis labels:
plot(DNase$conc, DNase$density, xlab="DNase concentration (ng/ml)", ylab="Optical Density")
Example Output:
0 2 4 6 8 12
DNase concentration (ng/ml)
0.0 1.0 2.0
Optical density
Base Plotting in R – Plot Symbols
Change the plot symbol using the
pchparameter:
plot(DNase$conc, DNase$density, pch=4, xlab="DNase concentration (ng/ml)", ylab="Optical density")
Example Output:
Base Plotting in R – Color
Change the symbol color using the
colparameter:
plot(DNase$conc, DNase$density, pch=4, col="blue", xlab="DNase concentration (ng/ml)", ylab="Optical density")
Example Output:
Base Plotting in R – Adding Fit Lines
Add a linear fit line using
abline():
plot(DNase$conc, DNase$density, pch=4, col="blue", xlab="DNase concentration (ng/ml)", ylab="Optical density")
abline(lm(DNase$density ~ DNase$conc))
Example Output:
Base Plotting in R – Line Type
Change the color and line type of the fit line:
plot(DNase$conc, DNase$density, pch=4, col="blue", xlab="DNase concentration (ng/ml)", ylab="Optical density")
abline(lm(DNase$density ~ DNase$conc), lty=2, col="red")
Example Output:
Boxplot
A box plot is a representation of the five-number summary including:
Maximum
Minimum
Median
First quartile (25th quantile)
Third quartile (75th quantile)
Note that outliers are excluded when R computes the above statistics.
Creating a Boxplot in R
boxplot(density ~ Run, data=DNase)
Boxplot (continued)
For a sample vector
x:
x = 0:100
boxplot(x)
Results:
Minimum: 0
Maximum: 100
Median: 50
1st quartile: 25
3rd quartile: 75
Creating Boxplot with Outliers
For a vector including an outlier:
boxplot(c(0:100, 200))
Histogram
To create a histogram of
DNase$density:
hist(DNase$density)
This generates a distribution of
DNase$densityvalues showing frequency.
Histogram (continued)
Define custom x-axis label and split the x-axis into 25 intervals:
hist(DNase$density, breaks=25, xlab="DNase concentration (ng/ml)")
Introduction to ggplot2
A package developed by Hadley Wickham that implements the grammar of graphics.
The package is popular among statisticians and data practitioners.
Notable features include that its syntax differs significantly from base R plot commands.
Online tutorials available for learning ggplot2.
Grammar of Graphics
Introduced by Wilkinson (2005), the grammar of graphics describes the deep features underlying all statistical graphics.
Key components of any plot:
Data: The data to be visualized and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes such as color, size, or shape.
Layers: Composed of geometric elements (geoms), representing what you see on the plot (points, lines, polygons, etc.) and statistical transformations (stats) that summarize the data.
Scales: Map values in the data space to values in an aesthetic space.
Coordinate Systems: Define how data coordinates are mapped to the graphic plane.
Faceting: Specification to separate data into subsets and display them as small multiples.
Theme: Controls finer aspects of the display, such as font size and background color.
ggplot for the DNase Dataset
Using ggplot to create a scatter plot for the DNase dataset (concentration vs. density).
library("ggplot2")
ggplot(DNase, aes(x = conc, y = density)) + geom_point()
ggplot for the DNase Dataset – Scatter Plot
Understanding the ggplot grammar:
ggplot: specifies the dataset and the aesthetic mapping (i.e.aes).geom_point: specifies the layer to be added to the plot;geom_pointcan be replaced by other geometries like boxplot, histogram, or bar.
ggplot for the DNase Dataset – Boxplot
Creating a boxplot:
ggplot(DNase, aes(x = Run, y = density)) + geom_boxplot()
ggplot for the DNase Dataset – Histogram
Creating a histogram:
ggplot(DNase, aes(density)) + geom_histogram()
Faceting
To examine the relationship between concentration and density for each assay run separately, faceting divides the dataset into subsets according to
Run, creating a plot for each subset.
ggplot(DNase, aes(x = conc, y = density)) + geom_point() + facet_wrap(Run ~ ., ncol=3)
Faceting (continued)
Change the points to blue open circles:
ggplot(DNase, aes(x = conc, y = density)) + geom_point(shape=1, colour="blue") + facet_wrap(Run ~ .)
Arguments for ggplot vs plot
Key differences in argument names:
colourin ggplot vscolin base plot.shapein ggplot vspchin base plot.
Although
pchandcolcan also be used in ggplot,shapeandcolourcannot be used in base plot.
ggplot for the chickwts Dataset
The
chickwtsdataset is an in-built R dataset from an experiment measuring the effectiveness of various feed supplements on the growth rates of chickens.Viewing the first six records of
chickwts:
head(chickwts)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
Levels of the
feedvariable:
levels(chickwts$feed)
## [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
ggplot for the chickwts Dataset – Barplot
Creating a barplot for the feed variable:
ggplot(chickwts, aes(feed)) + geom_bar()
ggplot for the chickwts Dataset – Boxplot
Creating a boxplot illustrating weight by feed type:
ggplot(chickwts, aes(x = feed, y = weight)) + geom_boxplot()
ggplot for the chickwts Dataset – Boxplot with Fill Color
Adding colors to boxplots:
ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) + geom_boxplot()
ggplot for the chickwts Dataset – Dotplot
Creating a dotplot is advisable when the number of data points is small:
ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) + geom_dotplot(binaxis='y', stackdir="center", dotsize=0.8)
Note: Overlapping points can obscure data and may result in a cluttered visualization.
ggplot for the chickwts Dataset – Violin Plot
Generating a violin plot to visualize weight by feed type:
ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) + geom_violin()
ggplot for the chickwts Dataset – Density Plot
Creating a density plot:
ggplot(chickwts, aes(x = weight, color = feed)) + geom_density()
Finding the Optimal Dose Level
Considering a new drug for lowering blood pressure, different doses are applied to patients with hypertension, recording the reduction in blood pressure:
head(highbp)
## dose bp.drop
## 1 0.05 -1.32581524
## 2 0.10 3.05457330
## 3 0.15 0.07748555
## 4 0.20 10.86112321
## 5 0.25 6.81803109
## 6 0.30 3.19812646
Finding the Optimal Dose Level – Scatter Plot
First, create a scatter plot:
ggplot(highbp, aes(x=dose, y=bp.drop)) + geom_point(shape=1)
Finding the Optimal Dose Level – Geom Smooth
To visualize the trend in data, using
geom_smooth:
ggplot(highbp, aes(x=dose, y=bp.drop)) + geom_smooth()
The blue line indicates the mean and gray band indicates confidence intervals.
Finding the Optimal Dose Level – Combined Plot
Combine the scatter plot and smooth line in a single plot:
ggplot(highbp, aes(x=dose, y=bp.drop)) + geom_point(shape=1) + geom_smooth()
Stepwise ggplot Approach
Create a stepwise plot where layers are added incrementally:
our.plot = ggplot(highbp, aes(x=dose, y=bp.drop))
our.plot = our.plot + geom_point(shape=1)
our.plot = our.plot + geom_smooth()
your.plot
Data Transformation
If variables
xandyare already defined, a scatterplot can be generated:
ggplot(data=NULL, aes(x=x, y=y)) + geom_point()
Data Transformation - Transparency
To improve visibility when dealing with numerous data points; make points transparent using
alpha:
ggplot(NULL, aes(x=x, y=y)) + geom_point(alpha=0.2)
Data Transformation - Square Root
Instead of plotting
yversusx, plotsqrt(y)versusx:
ggplot(NULL, aes(x=x, y=sqrt(y))) + geom_point(alpha=0.2)
The variance of
yappears stabilized after transformation.
Genomic Data Visualization
A display of the ideogram of human chromosome 1:
library("ggbio")
data("hg19IdeogramCyto", package = "biovizBase")
plotIdeogram(hg19IdeogramCyto, subchr="chr1")
Genomic Data: Editing Sites
Visualization of 500 RNA editing sites in the human genome:
library("GenomicRanges")
data("darned_hg19_subset500", package = "biovizBase")
autoplot(darned_hg19_subset500, layout="karyogram", aes(color=exReg, fill=exReg))
This renders a karyogram layout displaying the editing sites mapped against the human chromosome data linked to
exRegvalues.