Intro to Data Analysis & Programming - DAY 1
What is R and how the course uses it
R is a programming language built on S (the speaker notes this origin).
R as a tool for data analysis, focusing on the syntax and commands and then applying them to real data.
Use-case framing: data-driven decisions in agriculture and business rely on datasets that show how outcomes change under different conditions (e.g., altering fertilizer and measuring crop yield).
An agricultural example is used to demonstrate evidence-based decision making:
Two data feeds: an old feed and a new feed under the same conditions, to compare the impact of the fertilizer.
The goal is to quantify changes using data and datasets to support conclusions.
Open-ended emphasis: data analysis is about evidence, not just a single calculation; it relies on datasets and careful comparisons.
Open source means:
Free to use: no purchase required.
R packages and the ecosystem
Packages extend R’s core capabilities by adding new functions and workflows.
Core vs contributed packages:
Core packages provide fundamental capabilities.
Contributed packages add specialized functionality.
Package ecosystem is built around repositories, most notably CRAN (Comprehensive R Archive Network).
Packages are used: finance, econometrics, optimization, machine learning, etc.
How to choose packages: you typically rely on domain-specific packages to avoid reinventing the wheel.
Installing and loading packages in R
Key commands (all lowercase by convention):
install.packages("pkgName")– download and install a package from CRAN or another repository.library(pkgName)– load a package into the current R session so you can use its functions.
Important distinctions:
Installing a package makes it available on your system; loading a package makes its functions available in your session.
The package name in
library()does not include quotes in some interfaces, but many examples showlibrary("pkgName")depending on the environment; follow your IDE's convention.
Conceptual analogy used: installing a package is like receiving a boxed set of kitchen tools; loading a package is taking those tools out to use.
Practical note: packages are organized in a directory on your computer; you can manage them via the IDE’s Package Manager UI or via commands.
Getting started with RStudio (the IDE)
RStudio is introduced as an integrated development environment (IDE) that combines editing, execution, and visualization in one place.
Typical layout (as demonstrated): three main windows, plus an editor that opens when you create a script
Console: where code is executed and results appear.
Environment: shows objects currently in memory (variables, data frames, etc.).
Graphics/Plots: where plots appear.
The instructor emphasizes that an IDE helps avoid juggling multiple tools (editor, console, plots) by integrating them in one workspace.
Working with the editor, scripts, and files
The instructor demonstrates creating an editor window in RStudio:
File > New File > R Script to open a script editor window.
The script editor is where you write and save commands for later execution.
File and folder preparation for organization:
Create a dedicated folder on the desktop (named something like 4321) to store scripts and class notes.
Inside this folder, scripts and notes can be saved with clear naming conventions.
File naming and extension conventions:
R scripts use the extension
.R(e.g.,class_notes_lecture1.R).Excel uses
.xlsx, Word uses.docx(the extension tells you the file type).
The instructor demonstrates the relationship between the file name and status indicators in the IDE (color cues):
A new file (untitled) starts with a red color, indicating it has not yet been saved.
After saving, the file name changes color (blue indicates executable content in the editor; red indicates the file is saved and has a designated path).
The console shows outputs and results (numbers in brackets, e.g.,
[1]), while the editor holds the code.
Summary of key terms and concepts
R: a programming language for data analysis, historically rooted in the S language, used to manipulate data and perform analyses.
CRAN: Comprehensive R Archive Network, the central repository for R packages.
Packages: collections of functions and data that extend R’s capabilities; examples include domain-specific tools for finance, econometrics, optimization, and machine learning.
IDE: Integrated Development Environment; RStudio is the example used, combining editing, execution, and visualization.
Working directory: the folder where your current project files are stored; essential for organizing scripts and data.
Script vs. console:
Script (.R) stores reproducible code; console shows immediate results.
Use the script editor to write and save code, execute via keyboard shortcuts to view results in the console.
Comments: lines starting with
#used to document code; they do not execute.Basic file-naming discipline: use descriptive names, include date when helpful, and maintain a consistent convention for easy retrieval.
Basic data-analysis notation (illustrative):
Difference in means:
Simple sample mean (illustrative):
Homework and follow-up guidance from the lecture
Read the material thoroughly before attempting exercises.
Create a dedicated folder (e.g., on the Desktop) named something like 4321 and save class notes and scripts there.
In your R script, begin with two lines of comments:
Your name (header)
A line noting the class and date (e.g., Class notes August 26, 2025)
Use the
#symbol to mark comments and clearly separate notes from code.Name your class notes file clearly (e.g.,
class_notes_lecture1_2025-08-26.R).Practice saving and running code from the editor using
Ctrl+Enter(Windows) orCmd+Return(macOS).Expect color cues in the IDE to indicate status: red means unsaved/new file, blue indicates executable code, and the saved state changes colors accordingly.
Ensure you understand the distinction between installing packages (
install.packages(...)) and loading them (library(...)).
Dynamic vs. Static Typing
R is dynamically typed, meaning it checks the type of data when the script is run.
In contrast, languages like C, C++, Java, Pascal, and Fortran are statically typed.
Basic Data Types
The values variables contain can be numbers, characters (text), logical values, etc.
The basic data types in R are:
numeric: e.g., ,integer: e.g., (TheLafter the number tells R to store it as an integer)complex: e.g., (complex numbers with real and imaginary parts)logical: e.g.,TRUE,FALSEcharacter:“Hello”,“b”
Objects and Variables
R does not declare variables as data types in the way other programming languages like C or Java do.
Everything in R is an object. Variables are R-objects.
Thus, the data type of the R-object becomes the data type of the variable.
Data Structures
R operates on named data structures, which is a specific form of organizing and storing data.
There are five basic data structures in R:
Vectors
Matrix
Lists
Data Frames
Factors
Built-in Data Sets
Some R-packages come with built-in datasets. There are a number of these included in R that will be used throughout the semester to illustrate R’s functionality.
Built-in datasets in R can be accessed using the command:
data("datasetname").For example:
data("Orange").You can get more details about each built-in dataset using the command:
?datasetname.Note: Many datasets are within packages. You must load the package containing the dataset using
library()before you can access the dataset.For example:
library(MASS)thendata("Insurance").
Importing Data in R
R can read data from files stored outside the R environment.
Data is available in various formats such as
.txt,.csv,.xls,.xlsx,.xml,.mtp, etc.For example, to read a text file:
uc <- read.table("LN6Usedcars.txt", header=T).Important: For R to read files, the files should be in the current working directory.
Simulating Random Data
R can create various types of random numbers from familiar distributions to specialized ones.
R will give numbers drawn from lots of different distributions. In order to use them, you only need to familiarize yourselves with the parameters that are given to the functions such as a mean, or a rate.
The
rprefix command is used to generate random data from various distributions.For example, to generate a random normal distribution of size 10:
rnorm(10, mean=5, sd=0.5).