INTRODUCTION TO R AND BASIS
1 Introduction and Preliminaries
1.1 The R Environment
R is an integrated suite of software for:
Data manipulation
Calculation
Graphical display
Features of R include:
Effective data handling and storage facilities
Operators for calculations on arrays, particularly matrices
Integrated collection of intermediate tools for data analysis
Graphical facilities for data analysis and display (both on computer and hard-copy)
Well-developed programming language called 'S' which includes:
Conditionals
Loops
User-defined recursive functions
Input and output facilities
The term "environment" characterizes R as a coherent system rather than an accumulation of rigid tools typical in other software.
1.2 Related Software and Documentation
R is an implementation of the S language developed at Bell Laboratories by:
Rick Becker
John Chambers
Allan Wilks
Basis for S-PLUS systems.
Evolution of the S language is detailed in four books by John Chambers and colleagues:
Basic Reference for R: "The New S Language: A Programming Environment for Data Analysis and Graphics" (Becker, Chambers, Wilks)
New features in 1991 release covered in "Statistical Models in S" (Chambers and Hastie)
Formal methods and classes in the methods package described in "Programming with Data" (Chambers)
Documentation for S/S-PLUS can typically be used with R with recognition of differences.
Refer to the R statistical system FAQ for existing documentation details.
1.3 R and Statistics
Although R was not introduced as solely a statistics system, it is widely used for statistical analysis.
R is best viewed as an environment for implementing many classical and modern statistical techniques.
A few statistical techniques are built into the base R environment, while many others are available as packages:
Approximately 25 standard and recommended packages come with R.
Further packages are available through the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org.
Users may need to search for specific statistical techniques.
1.4 R and the Window System
R is most conveniently used at a graphics workstation equipped with a windowing system.
This guide primarily discusses interaction with the UNIX shell, but equivalent concepts apply to Windows and macOS users.
Setting up R to utilize customizable features may involve tedious steps requiring local expert assistance.
1.5 Using R Interactively
Upon starting R, a prompt appears, usually '>', indicating readiness for input commands.
Suggested first-time procedure for UNIX users:
Create a sub-directory (e.g., work) to hold data files:
bash mkdir work cd workStart the R program with
bash REnter R commands as needed.
Quit R using
> q(). Options to save data are prompted upon exiting.
For Windows, the method is similar, involving the creation of a folder and changing the "Start In" field in the R shortcut.
1.6 An Introductory Session
A preliminary exploration of R is encouraged through the sample session in Appendix A to familiarize users with the system.
1.7 Getting Help with Functions and Features
R includes an inbuilt help system similar to Unix's man command.
To access help for any specific function (e.g.,
solve):Use
> help(solve)or> ?solveFor non-standard function names enclosed in quotes:
R > help("[[")
help.start()launches a web browser to access help pages for better navigation.To search for help on various topics:
Use
> ??solvefor help related tosolve.?help.searchprovides more examples.
Examples from help topics can be executed by:
> example(topic)
Windows versions offer additional help systems;
> ?helpcan provide more information.
1.8 R Commands, Case Sensitivity, etc.
R has a simple syntax and is case-sensitive.
Naming conventions include:
Allowed symbols typically include all alphanumeric characters, '.', and '_'.
Must start with a letter or '.' (if starting with '.', the second character cannot be a digit).
Basic command types include:
Expressions: evaluated and printed (values lost).
Assignments: values assigned to variables without automatic display.
Commands can be separated by ';' or newline.
Comments in syntax start with a hashmark (#).
Incomplete commands prompt for more input with a continuation marker ('+').
Command length is limited to approximately 4095 bytes.
1.9 Recall and Correction of Previous Commands
Command history can be recalled using arrow keys on both UNIX and Windows.
Users can edit previous commands using horizontal arrow keys and delete or add characters as needed.
Under UNIX, settings for recalling and editing are customizable via the readline library or Emacs using ESS (Emacs Speaks Statistics).
1.10 Executing Commands from or Diverting Output to a File
Commands can be stored in an external file (e.g.,
commands.R) and executed with:
> source("commands.R")
Use
sinkto divert output to a file (e.g.,record.lis):
> sink("record.lis")
Use
> sink()to restore console output.
1.11 Data Permanency and Removing Objects
Objects in R represent variables, arrays, character strings, functions, etc.
Objects can be displayed with:
> objects()
To erase objects, use:
> rm(x, y, z, ...)
Objects can be saved permanently using
.RDatafor future sessions, while command history is saved in.Rhistory.Maintaining separate working directories is recommended to avoid confusion among object names used in different analyses.
2 Simple Manipulations; Numbers and Vectors
2.1 Vectors and Assignment
R utilizes named data structures, with the numeric vector being the simplest form:
To create a vector named
xconsisting of numbers:R > x <- c(10.4, 5.6, 3.1, 6.4, 21.7)Here,
c()(concatenate function) creates a vector by joining provided values.Assignment symbol '<-' directs the value to the given object.
The '=' operator functions similarly.
Assignments can also be made as follows:
> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
In the reverse direction, this assignment can also be executed:
> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
Further assignments, like:
> y <- c(x, 0, x)
create a new vector y with 11 entries: two copies of x with a zero in between.
2.2 Vector Arithmetic
Operations on vectors are performed element-wise, allowing mismatched lengths:
Shorter vectors are recycled until the length matches the longest vector.
Constants are repeated accordingly.
Example of vector arithmetic:
> v <- 2*x + y + 1
Common arithmetic operations include
+,-,*,/, and^(exponentiation).Available functions include:
max,min,range(givesc(min(x), max(x))),length,sum, andprodfor vector metrics.For sample mean:
R > mean(x)For sample variance:
R > var(x) = sum((x-mean(x))^2)/(length(x)-1)
2.3 Generating Regular Sequences
R provides various methods to generate number sequences:
A colon operator can create vectors quickly:
R > 1:30Use of
seq()for generalized sequence generation, with arguments for start, end, length, and more:R > seq(-5, 5, by=.2)
Example for replicating vectors:
> s5 <- rep(x, times=5)
> s6 <- rep(x, each=5)
2.4 Logical Vectors
R allows logical data types: TRUE, FALSE, and NA (not available).
Conditions generate logical vectors, e.g.,
> temp <- x > 13
Logical operators include:
&, |, !for AND, OR, NOT.
Logical vectors convert to numeric (TRUE becomes 1, FALSE becomes 0).
2.5 Missing Values
Use NA to represent missing values.
Operations involving NA result in NA as output.
Use
is.na(x)to check for NA status.
Distinction made with NaN (Not a Number), generated by undefined computations:
> 0/0
> Inf - Inf
Use
is.nan(x)specifically for NaN values.
2.6 Character Vectors
Character vectors denote sequences of characters enclosed in double quotes:
> c("x-values", "New iteration results")
Concatenation of character strings can be done using
paste():
> labs <- paste(c("X","Y"), 1:10, sep="")
2.7 Index Vectors; Selecting and Modifying Subsets of a Data Set
Selecting elements of vectors can be accomplished using index vectors in square brackets:
> y <- x[!is.na(x)]
Different types of index vectors:
Logical vector (selects elements for TRUE).
Positive integral vector (selects elements by index).
Negative integral vector (excludes specified indices).
Character string vector (for named components).
Assignments using indexed expressions modify only selected elements:
> x[is.na(x)] <- 0
2.8 Other Types of Objects
While vectors are fundamental, R also features:
Matrices/arrays: multi-dimensional structures.
Factors: compact handling of categorical data.
Lists: flexible structures of varied types.
Data frames: matrix-like structures incorporating different variable types.
Functions: treated as objects extensible within R.
3 Objects, Their Modes and Attributes
3.1 Intrinsic Attributes: Mode and Length
R operates on objects such as:
Vectors of logical, numeric, complex, character or raw types.
Vectors must be of uniform mode, with NA as a special type for missing values.
Functions to determine mode and length:
mode(object)andlength(object).R supports coercion between modes through functions like
as.character().
3.2 Changing the Length of an Object
An empty object can acquire a mode, e.g.,
> e <- numeric().New components can extend objects via assignments:
> e[3] <- 17
Can truncate using:
> alpha <- alpha[2*1:5]
3.3 Getting and Setting Attributes
Use
attributes(object)to retrieve all defined attributes.Assign new attributes with:
> attr(z, "dim") <- c(10,10)
3.4 The Class of an Object
Every object has a class identified by
class(object). Possible classes include "matrix", "array", "factor", and "data.frame".Utilizing
unclass()temporarily removes class attributes for inspection.
4 Ordered and Unordered Factors
4.1 A Specific Example
Factors facilitate discrete classifications of vector components using:
> statef <- factor(state)
Levels are displayed using
levels(statef).
4.2 The Function tapply() and Ragged Arrays
tapply()applies functions to grouped components of a vector defined by factors:
> incmeans <- tapply(incomes, statef, mean)
Allows generating a means vector with components labelled by levels.
4.3 Ordered Factors
Create ordered factors using
ordered()to impose a natural ordering:
> ordered_states <- ordered(state)