INTRODUCTION TO R AND BASIS

1 Introduction and Preliminaries

1.1 The R Environment

  • R is an integrated suite of software for:

    • Data manipulation

    • Calculation

    • Graphical display

  • Features of R include:

    • Effective data handling and storage facilities

    • Operators for calculations on arrays, particularly matrices

    • Integrated collection of intermediate tools for data analysis

    • Graphical facilities for data analysis and display (both on computer and hard-copy)

    • Well-developed programming language called 'S' which includes:

    • Conditionals

    • Loops

    • User-defined recursive functions

    • Input and output facilities

  • The term "environment" characterizes R as a coherent system rather than an accumulation of rigid tools typical in other software.

1.2 Related Software and Documentation

  • R is an implementation of the S language developed at Bell Laboratories by:

    • Rick Becker

    • John Chambers

    • Allan Wilks

  • Basis for S-PLUS systems.

  • Evolution of the S language is detailed in four books by John Chambers and colleagues:

    • Basic Reference for R: "The New S Language: A Programming Environment for Data Analysis and Graphics" (Becker, Chambers, Wilks)

    • New features in 1991 release covered in "Statistical Models in S" (Chambers and Hastie)

    • Formal methods and classes in the methods package described in "Programming with Data" (Chambers)

  • Documentation for S/S-PLUS can typically be used with R with recognition of differences.

  • Refer to the R statistical system FAQ for existing documentation details.

1.3 R and Statistics

  • Although R was not introduced as solely a statistics system, it is widely used for statistical analysis.

  • R is best viewed as an environment for implementing many classical and modern statistical techniques.

  • A few statistical techniques are built into the base R environment, while many others are available as packages:

    • Approximately 25 standard and recommended packages come with R.

    • Further packages are available through the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org.

  • Users may need to search for specific statistical techniques.

1.4 R and the Window System

  • R is most conveniently used at a graphics workstation equipped with a windowing system.

  • This guide primarily discusses interaction with the UNIX shell, but equivalent concepts apply to Windows and macOS users.

  • Setting up R to utilize customizable features may involve tedious steps requiring local expert assistance.

1.5 Using R Interactively

  • Upon starting R, a prompt appears, usually '>', indicating readiness for input commands.

  • Suggested first-time procedure for UNIX users:

    1. Create a sub-directory (e.g., work) to hold data files:
      bash mkdir work cd work

    2. Start the R program with
      bash R

    3. Enter R commands as needed.

    4. Quit R using > q(). Options to save data are prompted upon exiting.

  • For Windows, the method is similar, involving the creation of a folder and changing the "Start In" field in the R shortcut.

1.6 An Introductory Session

  • A preliminary exploration of R is encouraged through the sample session in Appendix A to familiarize users with the system.

1.7 Getting Help with Functions and Features

  • R includes an inbuilt help system similar to Unix's man command.

  • To access help for any specific function (e.g., solve):

    • Use > help(solve) or > ?solve

    • For non-standard function names enclosed in quotes:
      R > help("[[")

  • help.start() launches a web browser to access help pages for better navigation.

  • To search for help on various topics:

    • Use > ??solve for help related to solve.

    • ?help.search provides more examples.

  • Examples from help topics can be executed by:

  > example(topic)
  • Windows versions offer additional help systems; > ?help can provide more information.

1.8 R Commands, Case Sensitivity, etc.

  • R has a simple syntax and is case-sensitive.

  • Naming conventions include:

    • Allowed symbols typically include all alphanumeric characters, '.', and '_'.

    • Must start with a letter or '.' (if starting with '.', the second character cannot be a digit).

  • Basic command types include:

    • Expressions: evaluated and printed (values lost).

    • Assignments: values assigned to variables without automatic display.

  • Commands can be separated by ';' or newline.

  • Comments in syntax start with a hashmark (#).

  • Incomplete commands prompt for more input with a continuation marker ('+').

  • Command length is limited to approximately 4095 bytes.

1.9 Recall and Correction of Previous Commands

  • Command history can be recalled using arrow keys on both UNIX and Windows.

  • Users can edit previous commands using horizontal arrow keys and delete or add characters as needed.

  • Under UNIX, settings for recalling and editing are customizable via the readline library or Emacs using ESS (Emacs Speaks Statistics).

1.10 Executing Commands from or Diverting Output to a File

  • Commands can be stored in an external file (e.g., commands.R) and executed with:

  > source("commands.R")
  • Use sink to divert output to a file (e.g., record.lis):

  > sink("record.lis")
  • Use > sink() to restore console output.

1.11 Data Permanency and Removing Objects

  • Objects in R represent variables, arrays, character strings, functions, etc.

  • Objects can be displayed with:

  > objects()
  • To erase objects, use:

  > rm(x, y, z, ...)
  • Objects can be saved permanently using .RData for future sessions, while command history is saved in .Rhistory.

  • Maintaining separate working directories is recommended to avoid confusion among object names used in different analyses.

2 Simple Manipulations; Numbers and Vectors

2.1 Vectors and Assignment

  • R utilizes named data structures, with the numeric vector being the simplest form:

    • To create a vector named x consisting of numbers:
      R > x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

    • Here, c() (concatenate function) creates a vector by joining provided values.

    • Assignment symbol '<-' directs the value to the given object.

    • The '=' operator functions similarly.

  • Assignments can also be made as follows:

  > assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
  • In the reverse direction, this assignment can also be executed:

  > c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
  • Further assignments, like:

  > y <- c(x, 0, x)

create a new vector y with 11 entries: two copies of x with a zero in between.

2.2 Vector Arithmetic

  • Operations on vectors are performed element-wise, allowing mismatched lengths:

    • Shorter vectors are recycled until the length matches the longest vector.

    • Constants are repeated accordingly.

  • Example of vector arithmetic:

  > v <- 2*x + y + 1
  • Common arithmetic operations include +, -, *, /, and ^ (exponentiation).

  • Available functions include:

    • max, min, range (gives c(min(x), max(x))), length, sum, and prod for vector metrics.

    • For sample mean:
      R > mean(x)

    • For sample variance:
      R > var(x) = sum((x-mean(x))^2)/(length(x)-1)

2.3 Generating Regular Sequences

  • R provides various methods to generate number sequences:

    • A colon operator can create vectors quickly:
      R > 1:30

    • Use of seq() for generalized sequence generation, with arguments for start, end, length, and more:
      R > seq(-5, 5, by=.2)

  • Example for replicating vectors:

  > s5 <- rep(x, times=5)  
  > s6 <- rep(x, each=5)

2.4 Logical Vectors

  • R allows logical data types: TRUE, FALSE, and NA (not available).

  • Conditions generate logical vectors, e.g.,

  > temp <- x > 13
  • Logical operators include:

    • &, |, ! for AND, OR, NOT.

  • Logical vectors convert to numeric (TRUE becomes 1, FALSE becomes 0).

2.5 Missing Values

  • Use NA to represent missing values.

    • Operations involving NA result in NA as output.

    • Use is.na(x) to check for NA status.

  • Distinction made with NaN (Not a Number), generated by undefined computations:

  > 0/0
  > Inf - Inf
  • Use is.nan(x) specifically for NaN values.

2.6 Character Vectors

  • Character vectors denote sequences of characters enclosed in double quotes:

  > c("x-values", "New iteration results")
  • Concatenation of character strings can be done using paste():

  > labs <- paste(c("X","Y"), 1:10, sep="")

2.7 Index Vectors; Selecting and Modifying Subsets of a Data Set

  • Selecting elements of vectors can be accomplished using index vectors in square brackets:

  > y <- x[!is.na(x)]
  • Different types of index vectors:

    1. Logical vector (selects elements for TRUE).

    2. Positive integral vector (selects elements by index).

    3. Negative integral vector (excludes specified indices).

    4. Character string vector (for named components).

  • Assignments using indexed expressions modify only selected elements:

  > x[is.na(x)] <- 0  

2.8 Other Types of Objects

  • While vectors are fundamental, R also features:

    • Matrices/arrays: multi-dimensional structures.

    • Factors: compact handling of categorical data.

    • Lists: flexible structures of varied types.

    • Data frames: matrix-like structures incorporating different variable types.

    • Functions: treated as objects extensible within R.

3 Objects, Their Modes and Attributes

3.1 Intrinsic Attributes: Mode and Length

  • R operates on objects such as:

    • Vectors of logical, numeric, complex, character or raw types.

    • Vectors must be of uniform mode, with NA as a special type for missing values.

  • Functions to determine mode and length: mode(object) and length(object).

  • R supports coercion between modes through functions like as.character().

3.2 Changing the Length of an Object

  • An empty object can acquire a mode, e.g., > e <- numeric().

  • New components can extend objects via assignments:

  > e[3] <- 17
  • Can truncate using:

  > alpha <- alpha[2*1:5]

3.3 Getting and Setting Attributes

  • Use attributes(object) to retrieve all defined attributes.

    • Assign new attributes with:

  > attr(z, "dim") <- c(10,10)

3.4 The Class of an Object

  • Every object has a class identified by class(object). Possible classes include "matrix", "array", "factor", and "data.frame".

  • Utilizing unclass() temporarily removes class attributes for inspection.

4 Ordered and Unordered Factors

4.1 A Specific Example

  • Factors facilitate discrete classifications of vector components using:

  > statef <- factor(state)
  • Levels are displayed using levels(statef).

4.2 The Function tapply() and Ragged Arrays

  • tapply() applies functions to grouped components of a vector defined by factors:

  > incmeans <- tapply(incomes, statef, mean)
  • Allows generating a means vector with components labelled by levels.

4.3 Ordered Factors

  • Create ordered factors using ordered() to impose a natural ordering:

  > ordered_states <- ordered(state)