Descriptive: Summarizes characteristics of a data set without interpretation.
Example: "How many people live in each province and territory in Canada?"
Exploratory: Looks for patterns, trends, or relationships in a single data set.
Example: "Does political party voting patterns change with indicators of wealth in a set of data collected on 2,000 people living in Canada?"
Predictive: Focuses on predicting outcomes based on existing data.
Example: "What political party will someone vote for in the next Canadian election?"
Inferential: Examines if findings from a single data set can apply to the wider population.
Example: "Does political party voting change with indicators of wealth for all people living in Canada?"
Summarization: Computes and reports aggregated values.
Example Question: "What is the average race time for runners in this data set?"
Visualization: Graphically plots data for interpretation.
Example Question: "Is there any relationship between race time and age for runners in this data set?"
Classification: Predicts a class/category for new observations.
Example Question: "Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?"
Regression: Predicts a quantitative value for new observations.
Example Question: "What will be the race time for a 20-year-old runner who weighs 50kg?"
Clustering: Finds unknown/unlabelled subgroups in a dataset.
Example Question: "What products are commonly bought together on Amazon?"
Estimation: Taking measurements or averages from a smaller group to a larger population.
Example Question: "Given a survey of cellphone ownership of 100 Canadians, what proportion of the entire Canadian population own Android phones?"
Data Science: use of reproducible + auditable processes to get value from data
Reproducible: easily repeated by others
Auditable: easily traced & critiqued
Data Set: Structured collection of numbers and characters.
Rows: Represent observations (horizontal).
Columns: Represent variables (vertical).
Data Frame: is specifically a data set with the row type
Loading the tidyverse
in R can produce messages about attached packages and potential conflicts.
Users can access specific versions of functions from loaded packages by using the package prefix (e.g., dplyr::filter()
).
Use lowercase letters, numbers, and underscore (_).
Avoid using spaces or special characters.
R is case-sensitive.
Use meaningful variable names for better readability of scripts.
The filter()
function: obtain subset of ROWS with specific values
Example: filter(data_frame, logical statement)
will return a new data frame with logical statement evaluated to TRUE .
The select()
function extracts specific COLUMNS from the data frame.
The mutate()
function adds/modifies columns in a data frame, allowing for new calculations or transformations of existing data.
ggplot
ggplot()
is a function used for creating visualizations in R, with layering concepts similar to how one builds a plot manually.
Advantages of using R:
statistical analysis functions go beyond Excel
free & open-source
transparent & reproducible code
can handle large amounts of data + complex analyses
Correlation does not equal causation: Even if a correlation existed, it wouldn't necessarily mean that one variable causes the other. There could be other factors influencing both variables.
More data and analysis: To make a more definitive conclusion, it would be helpful to have more data points and conduct statistical tests to quantify the strength and significance of any potential correlation
Path: where the file lives on computer, the directions to the file
Local (computer):
Absolute path: locates file with respect to the “root” folder on computer
full path from the root folder to the file's location on the filesystem
STARTS with /
e.g. /home/instructor/documents/timesheet.xlsx
Relative path: locates a file relative to your working directory
DOESN’T start with /
e.g. documents/timesheet.xlsx
Working directory: current folder/location containing the file we are currently working on, serving as reference point for relative paths
Format DOESN’T include the working directory
Use to navigate files efficiently & ensures code can be run on diff computer
Remote (on web):
Uniform Resource Locator (URL):A specific type of address used to access resources on the internet, which includes the protocol, domain name, and path to the resource.
http:// or https://
assign it to object named url to use it:
url <- "https://raw.githubusercontent.com/UBC_DCSI/data/main/can_lang.csv"
canlang_data <- read_csv("url")
canlang_data
Ex. if absolute path looks like: /Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx
and working directory is UBC
, relative path: BIOL363/SciaticNerveLab/sn_trial_1.xlsx
Navigate file, see what it looks like: column names? delimiters? lines to skip?
Prepare to load into R
skim data, might need to modify, look at how values are separated
might need to load a library
, download file, or connect to database
Load into R: check with reading data function
Inspect the result CAREFULLY to make sure it worked: to reduce bugs + speed up
read_csv()
: To read CSV files (commas as the delimiter)
read_tsv()
: To read tab-separated files (tabs between the columns)
read_delim()
: For flexibility in choosing different delimiters, can import both CSV and TSV files, but just SPECIFY WHICH DELIMITER during import
1. specify path to file as first argument
2. provide tab character “\t"
as delim argument
3. set col_names argument to FALSE to say there’s NO COLUMN NAMES GIVEN in data, if TRUE
first row will be treated as column names
e.g. canlang_data <- read_delim("data/can_lang_no_names.tsv", delim = "\t", col_names = FALSE)
read_excel()
: For reading Excel spreadsheets, load readxl
package first
sheet
argument to specify sheet # or name (when file has multiple sheets)
range
argument to specify cell ranges
skip
: to ignore non-crucial information at the top of the data files when loading, so data can be read properly
1. Look at data first to see HOW MANY LINES WE DON’T NEED (e.g. if columns start at line 4, skip the first 3 lines)
canlang_data <- read_csv("data/can_lang_meta-data.csv", skip = 3)
Connect to databases using dbConnect()
and run SQL queries through R without needing extensive SQL knowledge, thanks to packages like dbplyr
.
Use collect()
to bring data from a database into R as a data frame for local analysis.
We DIDN’T get a data frame from the database, but it’s a REFERENCE (data still stored in SQLite database)
In order to open a database in R, you need to take the following steps:
Connect to the database using the dbConnect
function.
library(DBI) #when using SQLite Database
canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
library(RPostgres) #when using PostgreSQL Database
canmov_conn <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db", host = "fakeserver.stat.ubc.ca", port = 5432, user = "user0001", password = "abc123")
For PostgreSQL, there’s additional info needed to include:
host
: URL pointing to where database is located
port
: communication endpoint btwn R and PostgreSQL database
Check what tables (similar to R dataframes, Excel spreadsheets) are in the database using the dbListTables
function
Once you've picked a table, create an R object for it using the tbl
function
Make sure you filter
& select
database table before using collect
Use write_csv()
to save processed data frames back to CSV format after analysis.
strings in R: Strings in R are sequences of characters used to represent text data. They can be created using single or double quotes.
Function | Description |
| To read CSV files. |
| To read CSV files, but uses ; for field separators and , for decimal pts |
| To read tab-separated files. |
| For flexibility in choosing different delimiters. |
| For reading Excel spreadsheets. |
| To save processed data frames back to CSV format.
|
| Connect to databases. |
| Bring data selected from a database into R as a data frame.
|
| To obtain a subset of ROWS w/ specific values
if there’s multiple CONDITIONS, you can JOIN THEM USING COMMA or & ex. |
| Extract specific COLUMNS from the data frame.
can also do |
| Adds or modifies columns in a data frame.
expression example: |
| Function used for creating visualizations in R. |
| Adds specified title to the plot |
| To load package into R, to use its functions and datasets available |
| For comments, R will ignore any text after the symbol |
| To rename columns, use subsequent argument |
| Use when data at URL is not formatted nicely, then can use read_ 1st argument: url, 2nd: path we want to store downloaded file |
| To list all names of tables in database
|
| To reference a table in the database (allows us to perform operations like selecting columns etc.) ex. |
| To count how many times a specific value appears in a column
|
| Argument to specify sheet # or name Argument to specify cell ranges
|
| Used to specify the location of a file on computer or on web when reading data into R, format of path varies depending on if it's local or remote file: when URL, assign it first:
|
| Replaces spaces w/ underscores & convert to lowercase to make it standard format
|
| Order rows by the values of the given columns (default is increasing)
|
| Used to modify the non-data components of the plot with specified options
|
| SELECTS ROWS according to their row number, keep rows in the given range ex. we want rows 1 to 10 |
| Keep the n rows with the LARGEST values of a variable
|
| Keep the n rows with the SMALLEST values of a variable
|
| To preview the first 6 rows of dataframe |
| To preview last 6 rows of dataframe |
| Counts the number of rows in a table’s group e.g. column has names of restaurants and we want to find the count for each restaurant name |
| Organizes a variable based on values of second variable e.g. |
| To pull up documentation for most functions |
Real-World Data Issues:
Data is often messy:
Inconsistent formats (e.g., commas, tabs, semicolons).
Missing data and extra empty lines.
Split into multiple files (e.g., yearly data).
Corrupted files and custom formats.
Even after loading data in R, it may remain messy.
Key Point: Make data "tidy"
Tidy data enhances: Reproducibility, Efficiency, Collaboration
Tidy datasets are organized with multiple variables; messy datasets are typically structured in a single column.
Vectors: An objects that contains collections of ordered elements of the SAME DATA TYPE
Create using the c()
function
Lists: Ordered collections that can contain MIXED TYPES of elements.
Data Frame as a Special List:
Variables should be in columns
Each row should represent a single observation
Each cell is single measurement
Easier manipulation, plotting, and analysis in consistent formats.
Different representations of tuberculosis data are analyzed for tidiness.
Analysis Results:
Example structures of data:
Tidy Data:
Country
, Year
, Cases
, Population
Complies with tidy data principles (each variable in its own column).
Not Tidy Data:
Columns with non-specific names or multiple values in one column (e.g., year grouped with cases and population).
Requires restructuring to be considered tidy
Focus on tidyverse
package functions:
dplyr functions:
select
, filter
, mutate
, group_by
, summarize
tidyr functions:
pivot_longer
, pivot_wider
purrr function:
map_dfr
Examples using library functions demonstrated with penguins
dataset.
Key functions:
Select: Choose columns from a data frame.
Filter: Subset rows based on conditions (e.g., flipper length).
Mutate: Transform existing columns or create new columns (e.g., mass conversion).
Discussed methods to handle multiple operations:
Saving Intermediate Objects:
Drawbacks include complexity and potential memory issues.
Composing Functions:
Disadvantage: Less readable due to inner function execution first.
Using Pipes (|>
):
Enhances readability and avoids intermediate variables.
group_by()
+ summarize()
Concepts:
Grouping data based on COLUMN VALUES.
Summarizing is when you COMBINE DATA into fewer summary values (e.g., average pollution).
Applications to penguins
dataset: average mass by species.
Iteration Defined:
Executing a process multiple times (e.g., in data manipulation).
map_dfr
Functionality:
Example of iterating over multiple columns in the penguins
dataset.
Caution on the multiplicity of map_
variants available.
na.rm
in R: Argument in functions to handle missing values (NA).
Purpose: Specify whether to remove NA values before computation.
Usage: Set to TRUE
to ignore NAs, allowing calculations on available data.
Example: mean(x, na.rm = TRUE)
calculates the mean while excluding NAs.
Importance: Useful for accurate analysis in datasets with missing values.
Operators for Data Wrangling:
Comparison:
==
: Extracting rows w/ certain value, checks for equality btwn two values, returning true if equal and false otherwise
!=
: Extracting rows that DON’T have certain value, means NOT EQUAL TO
<
: Look for values BELOW a threshold
<=
: Look for values EQUAL TO OR BELOW a threshold
>
: Look for values ABOVE a threshold
>=
: Look for values EQUAL TO OR ABOVE a threshold
Logical:
%in%
: Extracting rows w/ values in vector, used to see if an value belongs to a vector
!
: Means NOT, changing TRUE to FALSE and FALSE to TRUE
&
: Extracting rows satisfying multiple conditions, works just like the comma
|
: Vertical pipe to give cases where one condition or another or both are satisfied
e.g. filter(official_langs, region == "Calgary" | region == "Edmonton")
%>%
: sends result of one function to the next function
Function | Description |
| Transform data FROM WIDE TO LONG FORMAT
|
| Convert data FROM LONG TO WIDE FORMAT
|
| Split a single column into multiple columns by delimiter.
|
| Generate summary statistics on columns
|
dfr is for data frames, combining row-wise | Apply functions to each element in list/vector & returns results as single data frame
DOESN’T have an argument to specify which column to apply functions to, so use select beforehand |
| To group by one or more variables
|
| Apply functions to MULTIPLE COLUMNS simultaneously e.g. |
| Applies functions across COLUMNS WITHIN ONE ROW |
concatenate | To CREATE VECTORS/COLUMNS
|
| Filtering data frames based on matches in another data frame |
| To convert data in numeric format |
| To have plots side by side for comparison
|
Questions driving visualization: Each visualization should aim to answer a specific question from the dataset.
Clarity and Greatness: A good visualization answers a question clearly, while a great one hints at the question itself.
Types of questions answered through visualizations:
Descriptive: E.g., What are the largest 7 landmasses on Earth?
Exploratory: E.g., Is there a relationship between penguin body mass and bill length?
Other types: inferential, predictive, causal, mechanistic.
ggplot2: Key components include:
Aesthetic Mappings (aes
): map dataframe columns to visual properties
Geometric Objects (geom
): encode HOW TO DISPLAY those visual properties
Scales: transform variables, set limits
Elements are combined using the +
operator.
Categorical Variables: can be divided into groups (categories) e.g. marital status
Quantitative Variables: measured on numeric scale (e.g. height)
To visualize relationship between 2 QUANTITATIVE variables
Example Question: How does horsepower affect the fuel economy of an engine?
To visualize TRENDS w/ respect to an INDEPENDENT QUANTITY (e.g. time)
Example: Changes in atmospheric CO2 levels over the last 40 years.
To visualize COMPARISON of AMOUNTS
Example: Number of students in each faculty (sci, bus, med, arts)
To visualize DISTRIBUTION OF SINGLE QUANTITATIVE VARIABLE
Example: Differences in life expectancy across continents in 2016.
Avoid using tables, pie charts, and 3D visualizations for clarity and simplicity
Use color palettes that are simple and colorblind-friendly
A great visualization conveys its story without requiring additional explanation
Convey message, minimize noise
use legends and labels
ensure text is big enough to read and clearly visible
be wary of overplotting
Presented visualizations should form a narrative:
Establish the setting and scope.
Pose the question.
Use the visualization to answer the question.
Summarize findings and motivate further discussion.
Choose appropriate output formats based on needs:
Raster Formats: 2D grid, often compressed before storing so takes less space
JPEG (.jpg, .jpeg
): lossy, usually for photos
PNG (.png
): lossless, usually for PLOTS / LINE drawings
BMP (.bmp
): LOSSLESS, RAW IMAGE data, no compression (rarely used)
TIFF (.tif, .tiff
): typically lossless, no compression, used mostly in graphic arts, publishing
Vector Formats: collection of mathematical objects (lines, surfaces, shapes, curves)
SVG (.svg
): general purpose use
EPS (.eps
): general purpose use, infinitely scaleable great for presentation
Function | Description |
| To make scatterplot |
| To make line plot |
| To make histogram |
| To make bar graph |
| Adds VERTICAL line to a plot, to INDICATE SPECIFIC X-AXIS e.g. threshold, average |
| Adds HORIZONTAL line to a plot, to INDICATE SPECIFIC Y-AXIS e.g. threshold, average |
| To create aesthetic mapping |
| To create geometric object |
| Sets limits for x-axis/y-axis in plot, controls range of values displayed |
For aes mapping, fills in BARS by specific color or separates counts by a variable aside form the x/y axis | |
| Prevents chart from being stacked, PRESERVES VERTICAL position of plot while adjusting horizontal position |
| To distinguish different groups/categories by color |
| Allows further visualization by varying data points by shape |
| For labels, axes |
| To create a plot that has MULTIPLE SUBPLOTS arranged in grid e.g.
|
| To save plots in various formats |
| To create line break in axis names |
| For fill aes w/ CATEGORICAL variable |
| For fill aes w/ NUMERIC variable |
| To accomplish logarithmic scaling when AXES have LARGE number, also FORMAT AXIS LABELS to increase readability e.g. |
| name of the column we want to use for comparing e.g. |
| In aes mapping, to reorder factors based on values of another variable e.g. |
| Convert variables to factor type Factor: data type used to represent categories |
add alpha = 0.2
to geom_point
to enhance visibility when points overlap
theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical")
options(repr.plot.width = 8, repr.plot.height = 8)
to set plot size
geom_bar(stat = "identity")
to DISPLAY VALUES in data frame AS IS
geom_bar(position = "identity")
to DISPLAY BARS exactly as they are w/o any stacking or dodging
geom_vline(xintercept = 792.458, linetype = "dashed")
to make certain lines stand out
geom_histogram(bins, binwidth)
Version Control: the process of keeping a record of changes to documents, indicating changes made, and who made them
View earlier versions and revert changes
Facilitates resolving conflicts in edits.
Originally for software development, but now used for many tasks
Initial approach: Sending files to teammates via email.
Problems:
Unclear versioning and edit tracking.
No insight into who made edits or when.
Difficulty in reverting to prior versions.
Poor communication regarding tasks and issues.
Alternative approach: Sharing via Dropbox/Google Drive.
Limitations:
Simple version management but lacks:
Edit tracking (who, when).
Clarity on project status after extended periods.
Easy means to revert changes.
Organized discussions for tasks/issues.
Complicated file naming like "final_revision_v3_Oct2020.docx" instead of clear version history.
Git → Version Control Tool
Tracks files in a repository (folder that you tell Git to pay attention to)
Responsible for keeping track of changes, sharing files, and conflict resolution
Runs on local machines of all users
GitHub → Cloud Service, Online Repositories
Cloud service that hosts repositories
Manages permissions (view/edit access)
Provides tools for project-specific communication (issues tracking)
Can build and host websites/blogs
Two copies of the repository are typically created:
Local Repository: Personal working copy on OWN COMPUTER
Remote Repository: Copy of project hosted on server ONLINE, for sharing w/ others
Both copies maintain a full project history with commits (snapshots of project files at specific times).
Commits include a message and a unique hash as identifiers.
Overview of important files and areas:
Working Directory: Files like analysis.ipynb
, notes.txt
, and README.md
on the local computer.
Staging Area: Where changes are prepared for a commit (.git
).
Remote Repository: GitHub stores your files.
Committing changes: Save your changes in the local repo
Working Directory: Contains files that may have pending changes
Git Add: command to stage selected files to prepare for committing
Typical files to stage: analysis.ipynb
, README.md
.
Git Commit: takes snapshot of files in the staging area, recording changes
Includes message to describe the nature of the changes
Pushing changes: Share commits with remote repo
Git Push: UPLOAD UPDATES to the REMOTE repo on GitHub
Keeps COLLABORATORS SYNCHRONIZED w/ latest changes
When want to back up work
Pulling changes: COLLECTING NEW CHANGES others added on remote repo to your local machine
Git Pull: download file to computer
To SYNCHRONIZE LOCAL repo w/ changes made by collaborators on the remote repo
Cloning a repo: copy/download ENTIRE CONTENT (files, project, history, location of repo) of a REMOTE server repo to LOCAL MACHINE
Git Clone: creates connection between local and remote repo
To GET a WORKING COPY of repo
Git has distinct step of adding files to staging area:
Not all changes made are ones we want to push to remote GitHub repo
Allows us to edit multiple files at once
Associate particular commit message w/ particular files (so they specifically reflect changes made)
GitHub Issues: NOT ideal for specific project communication
Issues only persist while they are open, deleted when closed
CAN CLONE/PULL from any PUBLIC remote repo on GitHub, BUT can only push to public remote repo YOU OWN or have ACCESS to
Editing files on GitHub w/ pen tool: reserved for small edits to plain text files
Generate GitHub personal access token to authenticate when sending/retrieving work
Good practice to pull changes at start of every work session before working on local copy
Handling merge conflicts: open offending file in plain text editor & identify conflict markers (<<<<<<<, =======, >>>>>>>
)
Manually edit file to resolve the conflicts, ensuring to keep the necessary changes from both versions before saving & committing resolved file
Example: Diagnosing cancer tumor cells
Labels: "benign" or "malignant"
Questions arise about new cases based on features (e.g., Concavity and Perimeter)
Classification methods aim to answer questions regarding labels based on data
Predict label/class for a new observation using K closest points from dataset
Compute distance between new observation and each observation in training set, to find K nearest neighbours
can go beyond 2 predictors
Sort data in ascending order according to distances
Choose the top K rows as “neighbors”
Classifying new observation based on majority vote
Tidymodels
Package in Rtidymodels
: Collection of packages and handles computing distances, standardization, balancing, and prediction
Load the libraries and data
Necessary libraries include tidyverse
and tidymodels
Import data and mutate as necessary
Make a recipe
to specify predictors/response and preprocess data
recipe()
: Main argument in formula
Arguments: formula and data
prep()
& bake()
Build model specification (model_spec
) to specify model and training algorithm
model type: kind of model you want to fit
arguments: model parameter values
engine: underlying package the model should come from
mode: type of prediction
Put them together in a workflow and then fit it
Predict a new class label using predict
Standardization: important to account for all features in distance calculation, enhances prediction accuracy
Adjusting for center and spread
Shift and scale data to carry average of 0 and standard deviation of 1
Non standardized Data:
Issues arise if one variable's scale is significantly larger than others
Function | Description |
| To prepare data for modelling task |
| Centers numeric variables and standardize data |
| Scales numeric variables and standardize data |
| Prepares recipe for modelling by estimating required values from training data |
| Applies preprocessing steps to new data |
set.seed()
EXACTLY ONCE AT BEGINNING OF EVERY ANALYSIS
Randomness: crucial for reproducibility, ensures analysis is fair, unbiased, and not influenced by human input
Use Random Seeds: Ensures consistent results across iterations of analysis in R
Call set.seed()
to initialize the random number generator for consistent outputs.
Key Concept: A classifier doesn't need 100% accuracy but must minimize wrong predictions
Importance of evaluating classifiers on new data (test set) not seen during training
Split data: Use initial_split()
to separate into training (75%) and testing (25%).
Golden rule: Do not use test data for model building
Example: Breast Cancer Data
Predict benign vs malignant tumors from biopsy images using a classifier.
Measure accuracy:
Accuracy: correctness overall
Confusion Matrix: shows how many test set labels of each type are predicted correctly and incorrectly
Example confusion matrix:
True Positive (TP): 1
False Positive (FP): 3
True Negative (TN): 57
False Negative (FN): 4
Performance metrics based on confusion matrix:
Accuracy: 89%
Precision: 20%
Recall: 25%
Positive label: the label we are more interested in identifying
Precision: consistency, quantifies how many of positive predictions the classifier made were actually positive
Recall: found all the relevant items, quantifies how many of the positive observations in the test set were identified as positive
tidymodels
Workflow using tidymodels
Load necessary packages and set the random seed.
Split data: Use initial_split()
to separate into training (75%) and testing (25%)
Preprocess data using recipe
for standardization (important for K-NN accuracy) USING ONLY TRAINING DATA
Train Classifier: Specify model parameters, fit K-NN model on training data.
Predict on Test Set: Assess model performance by predicting labels for the test set
Performance Metrics: Leverage metrics()
to evaluate accuracy, precision, and recall
Critical Evaluation of Performance
Context of application defines acceptable accuracy and trade-offs between precision and recall.
Always compare classifier performance against a simple majority classifier
Importance of K parameter in K-NN and its impact on model performance.
Cross-Validation: Repeat process to evaluate accuracy using different splits of data.
Parameter Grid: Specify values to optimize during tuning and establish best-performing parameters.
Underfitting: High K leads to poor model complexity; oversimplifies and loses predictive capacity.
Overfitting: Low K considers too much noise; less reliable predictions on new data.
Effects of Irrelevant Predictors
Inclusion of irrelevant predictors harms model accuracy by introducing noise
Utilize systematic methods for variable selection by removing non-informative variables
Forward Selection Method
Iterative approach to build models with increasing sets of predictors.
Allows focusing on maintaining accuracy while minimizing predictor variables.
Function | Description |
| Ensures consistent results from random processes in analyses |
| To randomly pick a number
|
| To split dataset into training and testing sets |
| Calculates performance metrics for a model’s predictions, assess how well model is performing on test data |
| To view data with large number of columns |
| To add column of predictions to original data set |
| To fit the model for each value in a range of parameter values |