DSCI NOTES Week 1-6

Week 1: Introduction to Data Analysis

Types of Data Analysis Questions

  • Descriptive: Summarizes characteristics of a data set without interpretation.

    • Example: "How many people live in each province and territory in Canada?"

  • Exploratory: Looks for patterns, trends, or relationships in a single data set.

    • Example: "Does political party voting patterns change with indicators of wealth in a set of data collected on 2,000 people living in Canada?"

  • Predictive: Focuses on predicting outcomes based on existing data.

    • Example: "What political party will someone vote for in the next Canadian election?"

  • Inferential: Examines if findings from a single data set can apply to the wider population.

    • Example: "Does political party voting change with indicators of wealth for all people living in Canada?"

Analysis Tools

  1. Summarization: Computes and reports aggregated values.

    • Example Question: "What is the average race time for runners in this data set?"

  2. Visualization: Graphically plots data for interpretation.

    • Example Question: "Is there any relationship between race time and age for runners in this data set?"

  3. Classification: Predicts a class/category for new observations.

    • Example Question: "Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?"

  4. Regression: Predicts a quantitative value for new observations.

    • Example Question: "What will be the race time for a 20-year-old runner who weighs 50kg?"

  5. Clustering: Finds unknown/unlabelled subgroups in a dataset.

    • Example Question: "What products are commonly bought together on Amazon?"

  6. Estimation: Taking measurements or averages from a smaller group to a larger population.

    • Example Question: "Given a survey of cellphone ownership of 100 Canadians, what proportion of the entire Canadian population own Android phones?"

Data Science: use of reproducible + auditable processes to get value from data

Reproducible: easily repeated by others

Auditable: easily traced & critiqued

Data Set Structure

  • Data Set: Structured collection of numbers and characters.

    • Rows: Represent observations (horizontal).

    • Columns: Represent variables (vertical).

  • Data Frame: is specifically a data set with the row type

R Messages and Functions

  • Loading the tidyverse in R can produce messages about attached packages and potential conflicts.

  • Users can access specific versions of functions from loaded packages by using the package prefix (e.g., dplyr::filter()).

Naming Conventions in R

  • Use lowercase letters, numbers, and underscore (_).

  • Avoid using spaces or special characters.

  • R is case-sensitive.

  • Use meaningful variable names for better readability of scripts.

Filtering and Selecting Data: Filter THEN Select

  • The filter() function: obtain subset of ROWS with specific values

    • Example: filter(data_frame, logical statement) will return a new data frame with logical statement evaluated to TRUE .

  • The select() function extracts specific COLUMNS from the data frame.

Mutating Data

  • The mutate() function adds/modifies columns in a data frame, allowing for new calculations or transformations of existing data.

Data Visualization with ggplot

  • ggplot() is a function used for creating visualizations in R, with layering concepts similar to how one builds a plot manually.


Week 2: Reading Data

Advantages of using R:

  • statistical analysis functions go beyond Excel

  • free & open-source

  • transparent & reproducible code

  • can handle large amounts of data + complex analyses

  • Correlation does not equal causation: Even if a correlation existed, it wouldn't necessarily mean that one variable causes the other. There could be other factors influencing both variables.

  • More data and analysis: To make a more definitive conclusion, it would be helpful to have more data points and conduct statistical tests to quantify the strength and significance of any potential correlation

Types of File Paths

Path: where the file lives on computer, the directions to the file

Local (computer):

  • Absolute path: locates file with respect to the “root” folder on computer

    • full path from the root folder to the file's location on the filesystem

    • STARTS with /

    • e.g. /home/instructor/documents/timesheet.xlsx

  • Relative path: locates a file relative to your working directory

    • DOESN’T start with /

    • e.g. documents/timesheet.xlsx

    • Working directory: current folder/location containing the file we are currently working on, serving as reference point for relative paths

    • Format DOESN’T include the working directory

    • Use to navigate files efficiently & ensures code can be run on diff computer

Remote (on web):

  • Uniform Resource Locator (URL):A specific type of address used to access resources on the internet, which includes the protocol, domain name, and path to the resource.

    • http:// or https://

    • assign it to object named url to use it:

    • url <- "https://raw.githubusercontent.com/UBC_DCSI/data/main/can_lang.csv"
      canlang_data <- read_csv("url")
      canlang_data

Ex. if absolute path looks like: /Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx

and working directory is UBC, relative path: BIOL363/SciaticNerveLab/sn_trial_1.xlsx

Loading data from computer (Workflow): DO IT CAREFULLY

  1. Navigate file, see what it looks like: column names? delimiters? lines to skip?

  2. Prepare to load into R

    • skim data, might need to modify, look at how values are separated

    • might need to load a library, download file, or connect to database

  3. Load into R: check with reading data function

  4. Inspect the result CAREFULLY to make sure it worked: to reduce bugs + speed up

Reading Data Functions

  • read_csv(): To read CSV files (commas as the delimiter)

  • read_tsv(): To read tab-separated files (tabs between the columns)

  • read_delim(): For flexibility in choosing different delimiters, can import both CSV and TSV files, but just SPECIFY WHICH DELIMITER during import

    • 1. specify path to file as first argument

    • 2. provide tab character “\t" as delim argument

    • 3. set col_names argument to FALSE to say there’s NO COLUMN NAMES GIVEN in data, if TRUE first row will be treated as column names

    • e.g. canlang_data <- read_delim("data/can_lang_no_names.tsv", delim = "\t", col_names = FALSE)

  • read_excel(): For reading Excel spreadsheets, load readxl package first

    • sheet argument to specify sheet # or name (when file has multiple sheets)

    • range argument to specify cell ranges

Skipping Rows

  • skip: to ignore non-crucial information at the top of the data files when loading, so data can be read properly

  • 1. Look at data first to see HOW MANY LINES WE DON’T NEED (e.g. if columns start at line 4, skip the first 3 lines)

    • canlang_data <- read_csv("data/can_lang_meta-data.csv", skip = 3)

Working with Databases (SQLite, PostgreSQL)

  • Connect to databases using dbConnect() and run SQL queries through R without needing extensive SQL knowledge, thanks to packages like dbplyr.

  • Use collect() to bring data from a database into R as a data frame for local analysis.

We DIDN’T get a data frame from the database, but it’s a REFERENCE (data still stored in SQLite database)

Reading data from database

In order to open a database in R, you need to take the following steps:

  1. Connect to the database using the dbConnect function.

    library(DBI) #when using SQLite Database
    canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
    library(RPostgres) #when using PostgreSQL Database
    canmov_conn <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db", host = "fakeserver.stat.ubc.ca", port = 5432, user = "user0001", password = "abc123")

    For PostgreSQL, there’s additional info needed to include:

    host: URL pointing to where database is located

    port: communication endpoint btwn R and PostgreSQL database

  2. Check what tables (similar to R dataframes, Excel spreadsheets) are in the database using the dbListTables function

  3. Once you've picked a table, create an R object for it using the tbl function

  4. Make sure you filter & select database table before using collect

Writing Data to Files

  • Use write_csv() to save processed data frames back to CSV format after analysis.

strings in R: Strings in R are sequences of characters used to represent text data. They can be created using single or double quotes.

Function

Description

read_csv()

To read CSV files.

read_csv2()

To read CSV files, but uses ; for field separators and , for decimal pts

read_tsv()

To read tab-separated files.

read_delim()

For flexibility in choosing different delimiters.

read_excel()

For reading Excel spreadsheets.

write_csv()

To save processed data frames back to CSV format.

write_csv(dataframe, "path/to/your/file.csv")

dbConnect()

Connect to databases.

collect()

Bring data selected from a database into R as a data frame.

data_collected <- collect(data_selected)

filter()

To obtain a subset of ROWS w/ specific values

filter(dataframe_name, column_name == "specific_value")

if there’s multiple CONDITIONS, you can JOIN THEM USING COMMA or &

ex. filter(dataframe, column_name == "specific_value" & another_column>100)

select()

Extract specific COLUMNS from the data frame.

select(dataframe_name, column_we want_to_extract)

can also do select(dataframe, starts_with("bill_")) to select columns that starts the same

mutate()

Adds or modifies columns in a data frame.

mutate(dataframe_name, new_column_name = expression)

expression example: old_column * 10

ggplot()

Function used for creating visualizations in R.

ggtitle("title")

Adds specified title to the plot

library()

To load package into R, to use its functions and datasets available

#

For comments, R will ignore any text after the symbol

rename

To rename columns, use subsequent argument new_name = old_name

download.file()

Use when data at URL is not formatted nicely, then can use read_

1st argument: url, 2nd: path we want to store downloaded file

dbListTables()

To list all names of tables in database

tables <- dbListTables(data_frame)

tbl()

To reference a table in the database (allows us to perform operations like selecting columns etc.)

ex. flight_data <- tbl(database, "column")

count()

To count how many times a specific value appears in a column

count(data_frame_name, column_name)

sheet

range

Argument to specify sheet # or name

Argument to specify cell ranges

canlang_data <- read_excel("data/can_lang.xlsx", sheet = "Sheet1", range = "A1:C10")

path()

Used to specify the location of a file on computer or on web when reading data into R, format of path varies depending on if it's local or remote file:

when URL, assign it first:

url <- "https://raw.githubusercontent.com/UBC_DCSI/data/main/can_lang.csv"

clean_names()

Replaces spaces w/ underscores & convert to lowercase to make it standard format

happy_df <- happy_df |> clean_names()

arrange()

Order rows by the values of the given columns (default is increasing)

arrange(data frame, by = desc(column_to_sort_by)),

theme()

Used to modify the non-data components of the plot with specified options

theme(text = element_text(size = 20))

slice(tbl, row_range)

SELECTS ROWS according to their row number, keep rows in the given range

ex. we want rows 1 to 10 slice(data frame, 1:10)

slice_max()

Keep the n rows with the LARGEST values of a variable

slice_max(tbl, ordering_column, n = #rowstocut)

slice_min()

Keep the n rows with the SMALLEST values of a variable

slice_min(tbl, ordering_column, n = #rowstocut)

head()

To preview the first 6 rows of dataframe

tail()

To preview last 6 rows of dataframe

n()

Counts the number of rows in a table’s group

e.g. column has names of restaurants and we want to find the count for each restaurant name

reorder()

Organizes a variable based on values of second variable

e.g. reorder(languages, #_of_people_fluent)

?

To pull up documentation for most functions


Week 3: Wrangling

Real-World Data Issues:

  • Data is often messy:

    • Inconsistent formats (e.g., commas, tabs, semicolons).

    • Missing data and extra empty lines.

    • Split into multiple files (e.g., yearly data).

    • Corrupted files and custom formats.

  • Even after loading data in R, it may remain messy.

  • Key Point: Make data "tidy"

What is Tidy Data?

  • Tidy data enhances: Reproducibility, Efficiency, Collaboration

  • Tidy datasets are organized with multiple variables; messy datasets are typically structured in a single column.

Data Structures in R:

Vectors: An objects that contains collections of ordered elements of the SAME DATA TYPE

  • Create using the c() function

Lists: Ordered collections that can contain MIXED TYPES of elements.

Data Frame as a Special List:

Characteristics of Tidy Data

  • Variables should be in columns

  • Each row should represent a single observation

  • Each cell is single measurement

  • Easier manipulation, plotting, and analysis in consistent formats.

Examples of Tuberculosis Data

  • Different representations of tuberculosis data are analyzed for tidiness.

  • Analysis Results:

    • Example structures of data:

    • Tidy Data:

      • Country, Year, Cases, Population

      • Complies with tidy data principles (each variable in its own column).

    • Not Tidy Data:

      • Columns with non-specific names or multiple values in one column (e.g., year grouped with cases and population).

      • Requires restructuring to be considered tidy

Tools for Tidying and Wrangling Data

  • Focus on tidyverse package functions:

    • dplyr functions:

      • select, filter, mutate, group_by, summarize

    • tidyr functions:

      • pivot_longer, pivot_wider

    • purrr function:

      • map_dfr

Step-by-Step Data Transformation

  • Examples using library functions demonstrated with penguins dataset.

  • Key functions:

    • Select: Choose columns from a data frame.

    • Filter: Subset rows based on conditions (e.g., flipper length).

    • Mutate: Transform existing columns or create new columns (e.g., mass conversion).

Chaining Operations with R

  • Discussed methods to handle multiple operations:

    • Saving Intermediate Objects:

      • Drawbacks include complexity and potential memory issues.

    • Composing Functions:

      • Disadvantage: Less readable due to inner function execution first.

    • Using Pipes (|>):

      • Enhances readability and avoids intermediate variables.

Grouping and Summarizing Data: group_by() + summarize()

  • Concepts:

    • Grouping data based on COLUMN VALUES.

    • Summarizing is when you COMBINE DATA into fewer summary values (e.g., average pollution).

  • Applications to penguins dataset: average mass by species.

Introduction to Iteration

  • Iteration Defined:

    • Executing a process multiple times (e.g., in data manipulation).

  • map_dfr Functionality:

    • Example of iterating over multiple columns in the penguins dataset.

    • Caution on the multiplicity of map_ variants available.

Other

na.rm in R: Argument in functions to handle missing values (NA).

  • Purpose: Specify whether to remove NA values before computation.

  • Usage: Set to TRUE to ignore NAs, allowing calculations on available data.

  • Example: mean(x, na.rm = TRUE) calculates the mean while excluding NAs.

  • Importance: Useful for accurate analysis in datasets with missing values.

Operators for Data Wrangling:

  • Comparison:

    • ==: Extracting rows w/ certain value, checks for equality btwn two values, returning true if equal and false otherwise

    • !=: Extracting rows that DON’T have certain value, means NOT EQUAL TO

    • <: Look for values BELOW a threshold

    • <=: Look for values EQUAL TO OR BELOW a threshold

    • >: Look for values ABOVE a threshold

    • >=: Look for values EQUAL TO OR ABOVE a threshold

  • Logical:

    • %in%: Extracting rows w/ values in vector, used to see if an value belongs to a vector

    • !: Means NOT, changing TRUE to FALSE and FALSE to TRUE

    • &: Extracting rows satisfying multiple conditions, works just like the comma

    • |: Vertical pipe to give cases where one condition or another or both are satisfied

      • e.g. filter(official_langs, region == "Calgary" | region == "Edmonton")

    • %>%: sends result of one function to the next function

Functions for Data Wrangling

Function

Description

pivot_longer()

Transform data FROM WIDE TO LONG FORMAT

pivot_longer(dataframe, cols = FROM:TO, names_to = “NEW column for COLUMNS we are combining”, values_to = “NEW column for VALUES”)

pivot_wider()

Convert data FROM LONG TO WIDE FORMAT

pivot_wider(dataframe, names_from = OLDcolumn_we_take_variable_names_from, values_from = OLDcolumn_we_take_values_from)

separate()

Split a single column into multiple columns by delimiter.

separate(dataframe, col = oldcolumntosplit, into = c("new_column1", "new_column2"), sep = "delimiter", convert = TRUE)

summarize(data frame, summaries_as_arguments)

Generate summary statistics on columns

summarize(new_column = mean(old_column, na.rm = TRUE)

map_dfr()

dfr is for data frames, combining row-wise

Apply functions to each element in list/vector & returns results as single data frame

map_dfr(dataframe/vector/list, function to apply to each column)

DOESN’T have an argument to specify which column to apply functions to, so use select beforehand

group_by()

To group by one or more variables

group_by(data frame, column, anothercolumn)

across()

Apply functions to MULTIPLE COLUMNS simultaneously

e.g. summarize(across(mother_tongue:lang_known, max)

rowwise()

Applies functions across COLUMNS WITHIN ONE ROW

c()

concatenate

To CREATE VECTORS/COLUMNS

region <- c("Toronto", Montreal", "Vancouver")

semi_join()

Filtering data frames based on matches in another data frame

as.numeric()

To convert data in numeric format

library(cowplot)

plot_grid()

To have plots side by side for comparison

plot_grid(plot_1, plot_2)


Week 4: Visualization

Designing Visualizations

  • Questions driving visualization: Each visualization should aim to answer a specific question from the dataset.

  • Clarity and Greatness: A good visualization answers a question clearly, while a great one hints at the question itself.

  • Types of questions answered through visualizations:

    • Descriptive: E.g., What are the largest 7 landmasses on Earth?

    • Exploratory: E.g., Is there a relationship between penguin body mass and bill length?

    • Other types: inferential, predictive, causal, mechanistic.

Creating Visualizations in R

ggplot2: Key components include:

  1. Aesthetic Mappings (aes): map dataframe columns to visual properties

  2. Geometric Objects (geom): encode HOW TO DISPLAY those visual properties

  3. Scales: transform variables, set limits

  • Elements are combined using the + operator.

Types of Variables

  • Categorical Variables: can be divided into groups (categories) e.g. marital status

  • Quantitative Variables: measured on numeric scale (e.g. height)

Visualization Types

Scatter Plots

  • To visualize relationship between 2 QUANTITATIVE variables

  • Example Question: How does horsepower affect the fuel economy of an engine?

Line Plots

  • To visualize TRENDS w/ respect to an INDEPENDENT QUANTITY (e.g. time)

  • Example: Changes in atmospheric CO2 levels over the last 40 years.

Bar Plots (there’s SPACING BETWEEN BARS)

  • To visualize COMPARISON of AMOUNTS

  • Example: Number of students in each faculty (sci, bus, med, arts)

Histograms (NO space between bars)

  • To visualize DISTRIBUTION OF SINGLE QUANTITATIVE VARIABLE

  • Example: Differences in life expectancy across continents in 2016.

Rules of Thumb for Effective Visualizations

  • Avoid using tables, pie charts, and 3D visualizations for clarity and simplicity

  • Use color palettes that are simple and colorblind-friendly

  • A great visualization conveys its story without requiring additional explanation

Refining Visualizations

Convey message, minimize noise

  • use legends and labels

  • ensure text is big enough to read and clearly visible

  • be wary of overplotting

Explaining the Visualization

  • Presented visualizations should form a narrative:

    1. Establish the setting and scope.

    2. Pose the question.

    3. Use the visualization to answer the question.

    4. Summarize findings and motivate further discussion.

Saving the Visualization

Choose appropriate output formats based on needs:

  • Raster Formats: 2D grid, often compressed before storing so takes less space

    • JPEG (.jpg, .jpeg): lossy, usually for photos

    • PNG (.png): lossless, usually for PLOTS / LINE drawings

    • BMP (.bmp): LOSSLESS, RAW IMAGE data, no compression (rarely used)

    • TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing

  • Vector Formats: collection of mathematical objects (lines, surfaces, shapes, curves)

    • SVG (.svg): general purpose use

    • EPS (.eps): general purpose use, infinitely scaleable great for presentation

Function

Description

geom_point()

To make scatterplot

geom_line()

To make line plot

geom_histogram()

To make histogram

geom_bar()

To make bar graph

geom_vline()

Adds VERTICAL line to a plot, to INDICATE SPECIFIC X-AXIS e.g. threshold, average

geom_hline()

Adds HORIZONTAL line to a plot, to INDICATE SPECIFIC Y-AXIS e.g. threshold, average

aes()

To create aesthetic mapping

geom_*

To create geometric object

xlim and ylim

Sets limits for x-axis/y-axis in plot, controls range of values displayed

For aes mapping, fills in BARS by specific color or separates counts by a variable aside form the x/y axis

dodge

Prevents chart from being stacked, PRESERVES VERTICAL position of plot while adjusting horizontal position

color

To distinguish different groups/categories by color

shape

Allows further visualization by varying data points by shape

labs()

For labels, axes

facet_grid

rows argument

cols argument

vars()

To create a plot that has MULTIPLE SUBPLOTS arranged in grid

e.g. facet_grid(rows = vars(column))

rows if plot is to be split HORIZONTALLY

cols if plot is to be split VERTICALLY

vars() to correctly evaluate column names

ggsave

To save plots in various formats

\n

To create line break in axis names

scale_fill_brewer(palette = "...")

For fill aes w/ CATEGORICAL variable

scale_fill_distiller(palette = "...")

For fill aes w/ NUMERIC variable

scale_x_log10

scale_y_log10

To accomplish logarithmic scaling when AXES have LARGE number, also FORMAT AXIS LABELS to increase readability

e.g. scale_x_log10(labels = label_comma())

order_by

n

name of the column we want to use for comparing

e.g. islands_top12 <- slice_max(islands_df, order_by = size, n = 12) n is how many rows to keep

fct_reorder()

In aes mapping, to reorder factors based on values of another variable

e.g. y = fct_reorder(landmass, size, .desc = TRUE)

as_factor

Convert variables to factor type

Factor: data type used to represent categories

Other

  • add alpha = 0.2 to geom_point to enhance visibility when points overlap

  • theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical")

  • options(repr.plot.width = 8, repr.plot.height = 8) to set plot size

  • geom_bar(stat = "identity") to DISPLAY VALUES in data frame AS IS

  • geom_bar(position = "identity") to DISPLAY BARS exactly as they are w/o any stacking or dodging

  • geom_vline(xintercept = 792.458, linetype = "dashed") to make certain lines stand out

  • geom_histogram(bins, binwidth)


Week 5: Version Control

What’s Version Control?

  • Version Control: the process of keeping a record of changes to documents, indicating changes made, and who made them

    • View earlier versions and revert changes

    • Facilitates resolving conflicts in edits.

    • Originally for software development, but now used for many tasks

Why Do We Need Collaboration Tools?

  • Initial approach: Sending files to teammates via email.

    • Problems:

      • Unclear versioning and edit tracking.

      • No insight into who made edits or when.

      • Difficulty in reverting to prior versions.

      • Poor communication regarding tasks and issues.

  • Alternative approach: Sharing via Dropbox/Google Drive.

    • Limitations:

      • Simple version management but lacks:

        • Edit tracking (who, when).

        • Clarity on project status after extended periods.

        • Easy means to revert changes.

        • Organized discussions for tasks/issues.

        • Complicated file naming like "final_revision_v3_Oct2020.docx" instead of clear version history.

Git and GitHub

  • Git → Version Control Tool

    • Tracks files in a repository (folder that you tell Git to pay attention to)

    • Responsible for keeping track of changes, sharing files, and conflict resolution

    • Runs on local machines of all users

  • GitHub → Cloud Service, Online Repositories

    • Cloud service that hosts repositories

    • Manages permissions (view/edit access)

    • Provides tools for project-specific communication (issues tracking)

    • Can build and host websites/blogs

Version Control Repositories

Two copies of the repository are typically created:

  • Local Repository: Personal working copy on OWN COMPUTER

  • Remote Repository: Copy of project hosted on server ONLINE, for sharing w/ others

  • Both copies maintain a full project history with commits (snapshots of project files at specific times).

  • Commits include a message and a unique hash as identifiers.

Key Concepts and Commands

  • Overview of important files and areas:

    • Working Directory: Files like analysis.ipynb, notes.txt, and README.md on the local computer.

    • Staging Area: Where changes are prepared for a commit (.git).

    • Remote Repository: GitHub stores your files.

Version Control Workflows

Committing changes: Save your changes in the local repo

  1. Working Directory: Contains files that may have pending changes

  2. Git Add: command to stage selected files to prepare for committing

    • Typical files to stage: analysis.ipynb, README.md.

  3. Git Commit: takes snapshot of files in the staging area, recording changes

    • Includes message to describe the nature of the changes

Pushing changes: Share commits with remote repo

  1. Git Push: UPLOAD UPDATES to the REMOTE repo on GitHub

  • Keeps COLLABORATORS SYNCHRONIZED w/ latest changes

  • When want to back up work

Pulling changes: COLLECTING NEW CHANGES others added on remote repo to your local machine

  • Git Pull: download file to computer

  • To SYNCHRONIZE LOCAL repo w/ changes made by collaborators on the remote repo

Cloning a repo: copy/download ENTIRE CONTENT (files, project, history, location of repo) of a REMOTE server repo to LOCAL MACHINE

  • Git Clone: creates connection between local and remote repo

  • To GET a WORKING COPY of repo

Other

  • Git has distinct step of adding files to staging area:

    • Not all changes made are ones we want to push to remote GitHub repo

    • Allows us to edit multiple files at once

    • Associate particular commit message w/ particular files (so they specifically reflect changes made)

  • GitHub Issues: NOT ideal for specific project communication

    • Issues only persist while they are open, deleted when closed

  • CAN CLONE/PULL from any PUBLIC remote repo on GitHub, BUT can only push to public remote repo YOU OWN or have ACCESS to

  • Editing files on GitHub w/ pen tool: reserved for small edits to plain text files

  • Generate GitHub personal access token to authenticate when sending/retrieving work

  • Good practice to pull changes at start of every work session before working on local copy

  • Handling merge conflicts: open offending file in plain text editor & identify conflict markers (<<<<<<<, =======, >>>>>>>)

    • Manually edit file to resolve the conflicts, ensuring to keep the necessary changes from both versions before saving & committing resolved file


Week 6: Classification 1 - Using K-Nearest Neighbours

Classification Concept

  • Example: Diagnosing cancer tumor cells

    • Labels: "benign" or "malignant"

    • Questions arise about new cases based on features (e.g., Concavity and Perimeter)

    • Classification methods aim to answer questions regarding labels based on data

K-Nearest Neighbours (KNN) Classification

Predict label/class for a new observation using K closest points from dataset

  1. Compute distance between new observation and each observation in training set, to find K nearest neighbours

    • can go beyond 2 predictors

  2. Sort data in ascending order according to distances

  3. Choose the top K rows as “neighbors”

  4. Classifying new observation based on majority vote

Tidymodels Package in R

  • tidymodels: Collection of packages and handles computing distances, standardization, balancing, and prediction

  1. Load the libraries and data

  • Necessary libraries include tidyverse and tidymodels

  • Import data and mutate as necessary

  1. Make a recipe to specify predictors/response and preprocess data

  • recipe(): Main argument in formula

    • Arguments: formula and data

  • prep() & bake()

  1. Build model specification (model_spec) to specify model and training algorithm

    • model type: kind of model you want to fit

    • arguments: model parameter values

    • engine: underlying package the model should come from

    • mode: type of prediction

  2. Put them together in a workflow and then fit it

  3. Predict a new class label using predict

Importance of Standardization

Standardization: important to account for all features in distance calculation, enhances prediction accuracy

  • Adjusting for center and spread

  • Shift and scale data to carry average of 0 and standard deviation of 1

Non standardized Data:

  • Issues arise if one variable's scale is significantly larger than others

Function

Description

recipe()

To prepare data for modelling task

step_center()

Centers numeric variables and standardize data

step_scale()

Scales numeric variables and standardize data

prep()

Prepares recipe for modelling by estimating required values from training data

bake()

Applies preprocessing steps to new data


Week 7: Classification 2 - Evaluating & Tuning

Randomness and Seeds

set.seed() EXACTLY ONCE AT BEGINNING OF EVERY ANALYSIS

Randomness: crucial for reproducibility, ensures analysis is fair, unbiased, and not influenced by human input

  • Use Random Seeds: Ensures consistent results across iterations of analysis in R

  • Call set.seed() to initialize the random number generator for consistent outputs.

Evaluating Performance

Key Concept: A classifier doesn't need 100% accuracy but must minimize wrong predictions

  • Importance of evaluating classifiers on new data (test set) not seen during training

  • Split data: Use initial_split() to separate into training (75%) and testing (25%).

  • Golden rule: Do not use test data for model building

Example: Breast Cancer Data

  • Predict benign vs malignant tumors from biopsy images using a classifier.

Measure accuracy:

Accuracy and Confusion Matrix

  • Accuracy: correctness overall

  • Confusion Matrix: shows how many test set labels of each type are predicted correctly and incorrectly

    • Example confusion matrix:

      • True Positive (TP): 1

      • False Positive (FP): 3

      • True Negative (TN): 57

      • False Negative (FN): 4

  • Performance metrics based on confusion matrix:

    • Accuracy: 89%

    • Precision: 20%

    • Recall: 25%

  • Positive label: the label we are more interested in identifying

Precision: consistency, quantifies how many of positive predictions the classifier made were actually positive

Recall: found all the relevant items, quantifies how many of the positive observations in the test set were identified as positive

Evaluating Performance with tidymodels

Workflow using tidymodels

  • Load necessary packages and set the random seed.

  • Split data: Use initial_split() to separate into training (75%) and testing (25%)

  • Preprocess data using recipe for standardization (important for K-NN accuracy) USING ONLY TRAINING DATA

  • Train Classifier: Specify model parameters, fit K-NN model on training data.

  • Predict on Test Set: Assess model performance by predicting labels for the test set

  • Performance Metrics: Leverage metrics() to evaluate accuracy, precision, and recall

Critical Evaluation of Performance

  • Context of application defines acceptable accuracy and trade-offs between precision and recall.

  • Always compare classifier performance against a simple majority classifier

Tuning the Classifier

  • Importance of K parameter in K-NN and its impact on model performance.

  • Cross-Validation: Repeat process to evaluate accuracy using different splits of data.

  • Parameter Grid: Specify values to optimize during tuning and establish best-performing parameters.

Under/Overfitting Concepts

  • Underfitting: High K leads to poor model complexity; oversimplifies and loses predictive capacity.

  • Overfitting: Low K considers too much noise; less reliable predictions on new data.

Predictor Variable Selection

Effects of Irrelevant Predictors

  • Inclusion of irrelevant predictors harms model accuracy by introducing noise

  • Utilize systematic methods for variable selection by removing non-informative variables

Forward Selection Method

  • Iterative approach to build models with increasing sets of predictors.

  • Allows focusing on maintaining accuracy while minimizing predictor variables.

Function

Description

set.seed()

Ensures consistent results from random processes in analyses

sample()

To randomly pick a number

random_num <- sample(0:9, 10, replace = TRUE)

initial_split()

To split dataset into training and testing sets

metrics()

Calculates performance metrics for a model’s predictions, assess how well model is performing on test data

glimpse()

To view data with large number of columns

bind_cols()

To add column of predictions to original data set

tune_grid()

To fit the model for each value in a range of parameter values

robot