Lecture 6 - stats

Lecture 6 (Chapter 3 of MSMB)

High Quality Graphics in R

Goals for this Chapter

In this lecture we will:

  • Learn how to rapidly and flexibly explore datasets through visualization.

  • Create beautiful and intuitive plots for scientific presentations and publications.

  • Understand the logic behind the grammar of graphics concept.

  • Introduce ggplot2.

  • See how to plot data in one, two, or even three to five dimensions, and explore faceting.

  • Create "along-genome" plots for molecular biology data (or along other sequences, e.g., peptides).

R Packages We Will Use

  • Install the following library using the R command:

install.packages("ggplot2")  

Base Plotting in R

  • Introduction to base plotting using the DNase dataset.

  • The DNase data frame includes:

    • 176 rows and 3 columns of data.

    • Data obtained during the development of an ELISA assay for the recombinant protein DNase in rat serum.

    • Columns in DNase dataset:

    • Run: assay run

    • conc: concentration of the protein

    • density: measured optical density in the assay

  • The first seven rows of the dataset are presented as follows:

DNase[1:7, ]  
## Run  conc density  
## 1  1  0.04882812  0.017  
## 2  1  0.04882812  0.018  
## 3  1  0.19531250  0.121  
## 4  1  0.19531250  0.124  
## 5  1  0.39062500  0.206  
## 6  1  0.39062500  0.215  
## 7  1  0.78125000  0.377  

Base Plotting in R (continued)

  • Create a basic scatter plot using the plot function.

plot(DNase$conc, DNase$density)  
  • The x-axis represents DNase$conc, and the y-axis represents DNase$density. The plot will display concentration versus density for the ELISA assay of DNase.

  • Example Output:

0  2  4  6  8  12  
DNase$conc  
0.0  1.0  2.0  
DNase$density  


Figure 1: Plot of concentration vs. density for an ELISA assay of DNase.

Base Plotting in R (continued)

  • Change the axis labels:

plot(DNase$conc, DNase$density, xlab="DNase concentration (ng/ml)", ylab="Optical Density")  
  • Example Output:

0  2  4  6  8  12  
DNase concentration (ng/ml)  
0.0  1.0  2.0  
Optical density  


Figure 2: Plot of concentration vs. density for an ELISA assay of DNase.

Base Plotting in R – Plot Symbols

  • Change the plot symbol using the pch parameter:

plot(DNase$conc, DNase$density, pch=4, xlab="DNase concentration (ng/ml)", ylab="Optical density")  
  • Example Output:

    Figure 3: Plot of concentration vs. density for an ELISA assay of DNase.

Base Plotting in R – Color

  • Change the symbol color using the col parameter:

plot(DNase$conc, DNase$density, pch=4, col="blue", xlab="DNase concentration (ng/ml)", ylab="Optical density")  
  • Example Output:

    Figure 4: Plot of concentration vs. density for an ELISA assay of DNase.

Base Plotting in R – Adding Fit Lines

  • Add a linear fit line using abline():

plot(DNase$conc, DNase$density, pch=4, col="blue",  xlab="DNase concentration (ng/ml)", ylab="Optical density")  
abline(lm(DNase$density ~ DNase$conc))  
  • Example Output:

    Figure 5: Plot of concentration vs. density for an ELISA assay of DNase.

Base Plotting in R – Line Type

  • Change the color and line type of the fit line:

plot(DNase$conc, DNase$density, pch=4, col="blue", xlab="DNase concentration (ng/ml)", ylab="Optical density")  
abline(lm(DNase$density ~ DNase$conc), lty=2, col="red")  
  • Example Output:

    Figure 6: Plot of concentration vs. density for an ELISA assay of DNase.

Boxplot

  • A box plot is a representation of the five-number summary including:

    • Maximum

    • Minimum

    • Median

    • First quartile (25th quantile)

    • Third quartile (75th quantile)

  • Note that outliers are excluded when R computes the above statistics.

Creating a Boxplot in R

boxplot(density ~ Run, data=DNase)  

Boxplot (continued)

  • For a sample vector x:

x = 0:100  
boxplot(x)  
  • Results:

    • Minimum: 0

    • Maximum: 100

    • Median: 50

    • 1st quartile: 25

    • 3rd quartile: 75

Creating Boxplot with Outliers

  • For a vector including an outlier:

boxplot(c(0:100, 200))  

Histogram

  • To create a histogram of DNase$density:

hist(DNase$density)  
  • This generates a distribution of DNase$density values showing frequency.

Histogram (continued)

  • Define custom x-axis label and split the x-axis into 25 intervals:

hist(DNase$density, breaks=25, xlab="DNase concentration (ng/ml)")  

Introduction to ggplot2

  • A package developed by Hadley Wickham that implements the grammar of graphics.

  • The package is popular among statisticians and data practitioners.

  • Notable features include that its syntax differs significantly from base R plot commands.

  • Online tutorials available for learning ggplot2.

Grammar of Graphics

  • Introduced by Wilkinson (2005), the grammar of graphics describes the deep features underlying all statistical graphics.

  • Key components of any plot:

    • Data: The data to be visualized and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes such as color, size, or shape.

    • Layers: Composed of geometric elements (geoms), representing what you see on the plot (points, lines, polygons, etc.) and statistical transformations (stats) that summarize the data.

    • Scales: Map values in the data space to values in an aesthetic space.

    • Coordinate Systems: Define how data coordinates are mapped to the graphic plane.

    • Faceting: Specification to separate data into subsets and display them as small multiples.

    • Theme: Controls finer aspects of the display, such as font size and background color.

ggplot for the DNase Dataset

  • Using ggplot to create a scatter plot for the DNase dataset (concentration vs. density).

library("ggplot2")  
ggplot(DNase, aes(x = conc, y = density)) + geom_point()  

ggplot for the DNase Dataset – Scatter Plot

  • Understanding the ggplot grammar:

    • ggplot: specifies the dataset and the aesthetic mapping (i.e. aes).

    • geom_point: specifies the layer to be added to the plot; geom_point can be replaced by other geometries like boxplot, histogram, or bar.

ggplot for the DNase Dataset – Boxplot

  • Creating a boxplot:

ggplot(DNase, aes(x = Run, y = density)) + geom_boxplot()  

ggplot for the DNase Dataset – Histogram

  • Creating a histogram:

ggplot(DNase, aes(density)) + geom_histogram()  

Faceting

  • To examine the relationship between concentration and density for each assay run separately, faceting divides the dataset into subsets according to Run, creating a plot for each subset.

ggplot(DNase, aes(x = conc, y = density)) + geom_point() + facet_wrap(Run ~ ., ncol=3)  

Faceting (continued)

  • Change the points to blue open circles:

ggplot(DNase, aes(x = conc, y = density)) + geom_point(shape=1, colour="blue") + facet_wrap(Run ~ .)  

Arguments for ggplot vs plot

  • Key differences in argument names:

    • colour in ggplot vs col in base plot.

    • shape in ggplot vs pch in base plot.

  • Although pch and col can also be used in ggplot, shape and colour cannot be used in base plot.

ggplot for the chickwts Dataset

  • The chickwts dataset is an in-built R dataset from an experiment measuring the effectiveness of various feed supplements on the growth rates of chickens.

  • Viewing the first six records of chickwts:

head(chickwts)  
## weight  feed  
## 1  179 horsebean  
## 2  160 horsebean  
## 3  136 horsebean  
## 4  227 horsebean  
## 5  217 horsebean  
## 6  168 horsebean  
  • Levels of the feed variable:

levels(chickwts$feed)  
## [1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"  

ggplot for the chickwts Dataset – Barplot

  • Creating a barplot for the feed variable:

ggplot(chickwts, aes(feed)) + geom_bar()  

ggplot for the chickwts Dataset – Boxplot

  • Creating a boxplot illustrating weight by feed type:

ggplot(chickwts, aes(x = feed, y = weight)) + geom_boxplot()  

ggplot for the chickwts Dataset – Boxplot with Fill Color

  • Adding colors to boxplots:

ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) + geom_boxplot()  

ggplot for the chickwts Dataset – Dotplot

  • Creating a dotplot is advisable when the number of data points is small:

ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) + geom_dotplot(binaxis='y', stackdir="center", dotsize=0.8)  
  • Note: Overlapping points can obscure data and may result in a cluttered visualization.

ggplot for the chickwts Dataset – Violin Plot

  • Generating a violin plot to visualize weight by feed type:

ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) + geom_violin()  

ggplot for the chickwts Dataset – Density Plot

  • Creating a density plot:

ggplot(chickwts, aes(x = weight, color = feed)) + geom_density()  

Finding the Optimal Dose Level

  • Considering a new drug for lowering blood pressure, different doses are applied to patients with hypertension, recording the reduction in blood pressure:

head(highbp)  
## dose  bp.drop  
## 1 0.05 -1.32581524  
## 2 0.10  3.05457330  
## 3 0.15  0.07748555  
## 4 0.20 10.86112321  
## 5 0.25  6.81803109  
## 6 0.30  3.19812646  

Finding the Optimal Dose Level – Scatter Plot

  • First, create a scatter plot:

ggplot(highbp, aes(x=dose, y=bp.drop)) + geom_point(shape=1)  

Finding the Optimal Dose Level – Geom Smooth

  • To visualize the trend in data, using geom_smooth:

ggplot(highbp, aes(x=dose, y=bp.drop)) + geom_smooth()  
  • The blue line indicates the mean and gray band indicates confidence intervals.

Finding the Optimal Dose Level – Combined Plot

  • Combine the scatter plot and smooth line in a single plot:

ggplot(highbp, aes(x=dose, y=bp.drop)) + geom_point(shape=1) + geom_smooth()  

Stepwise ggplot Approach

  • Create a stepwise plot where layers are added incrementally:

our.plot = ggplot(highbp, aes(x=dose, y=bp.drop))  
our.plot = our.plot + geom_point(shape=1)  
our.plot = our.plot + geom_smooth()  
your.plot  

Data Transformation

  • If variables x and y are already defined, a scatterplot can be generated:

ggplot(data=NULL, aes(x=x, y=y)) + geom_point()  

Data Transformation - Transparency

  • To improve visibility when dealing with numerous data points; make points transparent using alpha:

ggplot(NULL, aes(x=x, y=y)) + geom_point(alpha=0.2)  

Data Transformation - Square Root

  • Instead of plotting y versus x, plot sqrt(y) versus x:

ggplot(NULL, aes(x=x, y=sqrt(y))) + geom_point(alpha=0.2)  
  • The variance of y appears stabilized after transformation.

Genomic Data Visualization

  • A display of the ideogram of human chromosome 1:

library("ggbio")  
data("hg19IdeogramCyto", package = "biovizBase")  
plotIdeogram(hg19IdeogramCyto, subchr="chr1")  

Genomic Data: Editing Sites

  • Visualization of 500 RNA editing sites in the human genome:

library("GenomicRanges")  
data("darned_hg19_subset500", package = "biovizBase")  
autoplot(darned_hg19_subset500, layout="karyogram", aes(color=exReg, fill=exReg))  
  • This renders a karyogram layout displaying the editing sites mapped against the human chromosome data linked to exReg values.