Knowt

0.0(0)

Generate Practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

DSCI NOTES Week 1-6

Week 1: Introduction to Data Analysis

Types of Data Analysis Questions

Descriptive: Summarizes characteristics of a data set without interpretation.
- Example: "How many people live in each province and territory in Canada?"
Exploratory: Looks for patterns, trends, or relationships in a single data set.
- Example: "Does political party voting patterns change with indicators of wealth in a set of data collected on 2,000 people living in Canada?"
Predictive: Focuses on predicting outcomes based on existing data.
- Example: "What political party will someone vote for in the next Canadian election?"
Inferential: Examines if findings from a single data set can apply to the wider population.
- Example: "Does political party voting change with indicators of wealth for all people living in Canada?"

Analysis Tools

Summarization: Computes and reports aggregated values.
- Example Question: "What is the average race time for runners in this data set?"
Visualization: Graphically plots data for interpretation.
- Example Question: "Is there any relationship between race time and age for runners in this data set?"
Classification: Predicts a class/category for new observations.
- Example Question: "Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?"
Regression: Predicts a quantitative value for new observations.
- Example Question: "What will be the race time for a 20-year-old runner who weighs 50kg?"
Clustering: Finds unknown/unlabelled subgroups in a dataset.
- Example Question: "What products are commonly bought together on Amazon?"
Estimation: Taking measurements or averages from a smaller group to a larger population.
- Example Question: "Given a survey of cellphone ownership of 100 Canadians, what proportion of the entire Canadian population own Android phones?"

Data Science: use of reproducible + auditable processes to get value from data

Reproducible: easily repeated by others

Auditable: easily traced & critiqued

Data Set Structure

Data Set: Structured collection of numbers and characters.
- Rows: Represent observations (horizontal).
- Columns: Represent variables (vertical).
Data Frame: is specifically a data set with the row type

R Messages and Functions

Loading the tidyverse in R can produce messages about attached packages and potential conflicts.
Users can access specific versions of functions from loaded packages by using the package prefix (e.g., dplyr::filter()).

Naming Conventions in R

Use lowercase letters, numbers, and underscore (_).
Avoid using spaces or special characters.
R is case-sensitive.
Use meaningful variable names for better readability of scripts.

Filtering and Selecting Data: Filter THEN Select

The filter() function: obtain subset of ROWS with specific values
- Example: filter(data_frame, logical statement) will return a new data frame with logical statement evaluated to TRUE .
The select() function extracts specific COLUMNS from the data frame.

Mutating Data

The mutate() function adds/modifies columns in a data frame, allowing for new calculations or transformations of existing data.

Data Visualization with `ggplot`

ggplot() is a function used for creating visualizations in R, with layering concepts similar to how one builds a plot manually.

Week 2: Reading Data

Advantages of using R:

statistical analysis functions go beyond Excel
free & open-source
transparent & reproducible code
can handle large amounts of data + complex analyses

Correlation does not equal causation: Even if a correlation existed, it wouldn't necessarily mean that one variable causes the other. There could be other factors influencing both variables.
More data and analysis: To make a more definitive conclusion, it would be helpful to have more data points and conduct statistical tests to quantify the strength and significance of any potential correlation

Types of File Paths

Path: where the file lives on computer, the directions to the file

Local (computer):

Absolute path: locates file with respect to the “root” folder on computer
- full path from the root folder to the file's location on the filesystem
- STARTS with /
- e.g. /home/instructor/documents/timesheet.xlsx
Relative path: locates a file relative to your working directory
- DOESN’T start with /
- e.g. documents/timesheet.xlsx
- Working directory: current folder/location containing the file we are currently working on, serving as reference point for relative paths
- Format DOESN’T include the working directory
- Use to navigate files efficiently & ensures code can be run on diff computer

Remote (on web):

Uniform Resource Locator (URL):A specific type of address used to access resources on the internet, which includes the protocol, domain name, and path to the resource.
- http:// or https://
- assign it to object named url to use it:
- url <- "https://raw.githubusercontent.com/UBC_DCSI/data/main/can_lang.csv" canlang_data <- read_csv("url") canlang_data

Ex. if absolute path looks like: /Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx

and working directory is UBC, relative path: BIOL363/SciaticNerveLab/sn_trial_1.xlsx

Loading data from computer (Workflow): DO IT CAREFULLY

Navigate file, see what it looks like: column names? delimiters? lines to skip?
Prepare to load into R
- skim data, might need to modify, look at how values are separated
- might need to load a library, download file, or connect to database
Load into R: check with reading data function
Inspect the result CAREFULLY to make sure it worked: to reduce bugs + speed up

Reading Data Functions

read_csv(): To read CSV files (commas as the delimiter)
read_tsv(): To read tab-separated files (tabs between the columns)
read_delim(): For flexibility in choosing different delimiters, can import both CSV and TSV files, but just SPECIFY WHICH DELIMITER during import
- 1. specify path to file as first argument
- 2. provide tab character “\t" as delim argument
- 3. set col_names argument to FALSE to say there’s NO COLUMN NAMES GIVEN in data, if TRUE first row will be treated as column names
- e.g. canlang_data <- read_delim("data/can_lang_no_names.tsv", delim = "\t", col_names = FALSE)

read_excel(): For reading Excel spreadsheets, load readxl package first
- sheet argument to specify sheet # or name (when file has multiple sheets)
- range argument to specify cell ranges

Skipping Rows

skip: to ignore non-crucial information at the top of the data files when loading, so data can be read properly

1. Look at data first to see HOW MANY LINES WE DON’T NEED (e.g. if columns start at line 4, skip the first 3 lines)
- canlang_data <- read_csv("data/can_lang_meta-data.csv", skip = 3)

Working with Databases (SQLite, PostgreSQL)

Connect to databases using dbConnect() and run SQL queries through R without needing extensive SQL knowledge, thanks to packages like dbplyr.

Use collect() to bring data from a database into R as a data frame for local analysis.

We DIDN’T get a data frame from the database, but it’s a REFERENCE (data still stored in SQLite database)

Reading data from database

In order to open a database in R, you need to take the following steps:

Connect to the database using the dbConnect function.

library(DBI) #when using SQLite Database
canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")

library(RPostgres) #when using PostgreSQL Database
canmov_conn <- dbConnect(RPostgres::Postgres(), dbname = "can_mov_db", host = "fakeserver.stat.ubc.ca", port = 5432, user = "user0001", password = "abc123")

For PostgreSQL, there’s additional info needed to include:

host: URL pointing to where database is located

port: communication endpoint btwn R and PostgreSQL database

Check what tables (similar to R dataframes, Excel spreadsheets) are in the database using the dbListTables function
Once you've picked a table, create an R object for it using the tbl function
Make sure you filter & select database table before using collect

Writing Data to Files

Use write_csv() to save processed data frames back to CSV format after analysis.

strings in R: Strings in R are sequences of characters used to represent text data. They can be created using single or double quotes.

Function	Description
`read_csv()`	To read CSV files.
`read_csv2()`	To read CSV files, but uses ; for field separators and , for decimal pts
`read_tsv()`	To read tab-separated files.
`read_delim()`	For flexibility in choosing different delimiters.
`read_excel()`	For reading Excel spreadsheets.
`write_csv()`	To save processed data frames back to CSV format. `write_csv(dataframe, "path/to/your/file.csv")`
`dbConnect()`	Connect to databases.
`collect()`	Bring data selected from a database into R as a data frame. `data_collected <- collect(data_selected)`
`filter()`	To obtain a subset of ROWS w/ specific values `filter(dataframe_name, column_name == "specific_value")` if there’s multiple CONDITIONS, you can JOIN THEM USING COMMA or & ex. `filter(dataframe, column_name == "specific_value" & another_column>100)`
`select()`	Extract specific COLUMNS from the data frame. `select(dataframe_name, column_we want_to_extract)` can also do `select(dataframe, starts_with("bill_"))` to select columns that starts the same
`mutate()`	Adds or modifies columns in a data frame. `mutate(dataframe_name, new_column_name = expression)` expression example: `old_column * 10`
`ggplot()`	Function used for creating visualizations in R.
`ggtitle("title")`	Adds specified title to the plot
`library()`	To load package into R, to use its functions and datasets available
`#`	For comments, R will ignore any text after the symbol
`rename`	To rename columns, use subsequent argument `new_name = old_name`
`download.file()`	Use when data at URL is not formatted nicely, then can use read_ 1st argument: url, 2nd: path we want to store downloaded file
`dbListTables()`	To list all names of tables in database `tables <- dbListTables(data_frame)`
`tbl()`	To reference a table in the database (allows us to perform operations like selecting columns etc.) ex. `flight_data <- tbl(database, "column")`
`count()`	To count how many times a specific value appears in a column `count(data_frame_name, column_name)`
`sheet` `range`	Argument to specify sheet # or name Argument to specify cell ranges `canlang_data <- read_excel("data/can_lang.xlsx", sheet = "Sheet1", range = "A1:C10")`
`path()`	Used to specify the location of a file on computer or on web when reading data into R, format of path varies depending on if it's local or remote file: when URL, assign it first: `url <- "https://raw.githubusercontent.com/UBC_DCSI/data/main/can_lang.csv"`
`clean_names()`	Replaces spaces w/ underscores & convert to lowercase to make it standard format `happy_df <- happy_df \|> clean_names()`
`arrange()`	Order rows by the values of the given columns (default is increasing) `arrange(data frame, by = desc(column_to_sort_by))`,
`theme()`	Used to modify the non-data components of the plot with specified options `theme(text = element_text(size = 20))`
`slice(tbl, row_range)`	SELECTS ROWS according to their row number, keep rows in the given range ex. we want rows 1 to 10 `slice(data frame, 1:10)`
`slice_max()`	Keep the n rows with the LARGEST values of a variable `slice_max(tbl, ordering_column, n = #rowstocut)`
`slice_min()`	Keep the n rows with the SMALLEST values of a variable `slice_min(tbl, ordering_column, n = #rowstocut)`
`head()`	To preview the first 6 rows of dataframe
`tail()`	To preview last 6 rows of dataframe
`n()`	Counts the number of rows in a table’s group e.g. column has names of restaurants and we want to find the count for each restaurant name
`reorder()`	Organizes a variable based on values of second variable e.g. `reorder(languages, #_of_people_fluent)`
`?`	To pull up documentation for most functions

Week 3: Wrangling

Real-World Data Issues:

Data is often messy:
- Inconsistent formats (e.g., commas, tabs, semicolons).
- Missing data and extra empty lines.
- Split into multiple files (e.g., yearly data).
- Corrupted files and custom formats.
Even after loading data in R, it may remain messy.

Key Point: Make data "tidy"

What is Tidy Data?

Tidy data enhances: Reproducibility, Efficiency, Collaboration
Tidy datasets are organized with multiple variables; messy datasets are typically structured in a single column.

Data Structures in R:

Vectors: An objects that contains collections of ordered elements of the SAME DATA TYPE

Create using the c() function

Lists: Ordered collections that can contain MIXED TYPES of elements.

Data Frame as a Special List:

Characteristics of Tidy Data

Variables should be in columns
Each row should represent a single observation
Each cell is single measurement
Easier manipulation, plotting, and analysis in consistent formats.

Examples of Tuberculosis Data

Different representations of tuberculosis data are analyzed for tidiness.
Analysis Results:
- Example structures of data:
- Tidy Data:
  - Country, Year, Cases, Population
  - Complies with tidy data principles (each variable in its own column).
- Not Tidy Data:
  - Columns with non-specific names or multiple values in one column (e.g., year grouped with cases and population).
  - Requires restructuring to be considered tidy

Tools for Tidying and Wrangling Data

Focus on tidyverse package functions:
- dplyr functions:
  - select, filter, mutate, group_by, summarize
- tidyr functions:
  - pivot_longer, pivot_wider
- purrr function:
  - map_dfr

Step-by-Step Data Transformation

Examples using library functions demonstrated with penguins dataset.
Key functions:
- Select: Choose columns from a data frame.
- Filter: Subset rows based on conditions (e.g., flipper length).
- Mutate: Transform existing columns or create new columns (e.g., mass conversion).

Chaining Operations with R

Discussed methods to handle multiple operations:
- Saving Intermediate Objects:
  - Drawbacks include complexity and potential memory issues.
- Composing Functions:
  - Disadvantage: Less readable due to inner function execution first.
- Using Pipes (|>):
  - Enhances readability and avoids intermediate variables.

Grouping and Summarizing Data: `group_by()` + `summarize()`

Concepts:
- Grouping data based on COLUMN VALUES.
- Summarizing is when you COMBINE DATA into fewer summary values (e.g., average pollution).
Applications to penguins dataset: average mass by species.

Introduction to Iteration

Iteration Defined:
- Executing a process multiple times (e.g., in data manipulation).
map_dfr Functionality:
- Example of iterating over multiple columns in the penguins dataset.
- Caution on the multiplicity of map_ variants available.

Other

na.rm in R: Argument in functions to handle missing values (NA).

Purpose: Specify whether to remove NA values before computation.
Usage: Set to TRUE to ignore NAs, allowing calculations on available data.
Example: mean(x, na.rm = TRUE) calculates the mean while excluding NAs.
Importance: Useful for accurate analysis in datasets with missing values.

Operators for Data Wrangling:

Comparison:
- ==: Extracting rows w/ certain value, checks for equality btwn two values, returning true if equal and false otherwise
- !=: Extracting rows that DON’T have certain value, means NOT EQUAL TO
- <: Look for values BELOW a threshold
- <=: Look for values EQUAL TO OR BELOW a threshold
- >: Look for values ABOVE a threshold
- >=: Look for values EQUAL TO OR ABOVE a threshold
Logical:
- %in%: Extracting rows w/ values in vector, used to see if an value belongs to a vector
- !: Means NOT, changing TRUE to FALSE and FALSE to TRUE
- &: Extracting rows satisfying multiple conditions, works just like the comma
- |: Vertical pipe to give cases where one condition or another or both are satisfied
  - e.g. filter(official_langs, region == "Calgary" | region == "Edmonton")
- %>%: sends result of one function to the next function

Functions for Data Wrangling

Function	Description
`pivot_longer()`	Transform data FROM WIDE TO LONG FORMAT `pivot_longer(dataframe, cols = FROM:TO, names_to = “NEW column for COLUMNS we are combining”, values_to = “NEW column for VALUES”)`
`pivot_wider()`	Convert data FROM LONG TO WIDE FORMAT `pivot_wider(dataframe, names_from = OLDcolumn_we_take_variable_names_from, values_from = OLDcolumn_we_take_values_from)`
`separate()`	Split a single column into multiple columns by delimiter. `separate(dataframe, col = oldcolumntosplit, into = c("new_column1", "new_column2"), sep = "delimiter", convert = TRUE)`
`summarize(data frame, summaries_as_arguments)`	Generate summary statistics on columns `summarize(new_column = mean(old_column, na.rm = TRUE)`
`map_dfr()` dfr is for data frames, combining row-wise	Apply functions to each element in list/vector & returns results as single data frame `map_dfr(dataframe/vector/list, function to apply to each column)` DOESN’T have an argument to specify which column to apply functions to, so use select beforehand
`group_by()`	To group by one or more variables `group_by(data frame, column, anothercolumn)`
`across()`	Apply functions to MULTIPLE COLUMNS simultaneously e.g. `summarize(across(mother_tongue:lang_known, max)`
`rowwise()`	Applies functions across COLUMNS WITHIN ONE ROW
`c()` concatenate	To CREATE VECTORS/COLUMNS `region <- c("Toronto", Montreal", "Vancouver")`
`semi_join()`	Filtering data frames based on matches in another data frame
`as.numeric()`	To convert data in numeric format
`library(cowplot)` `plot_grid()`	To have plots side by side for comparison `plot_grid(plot_1, plot_2)`

Week 4: Visualization

Designing Visualizations

Questions driving visualization: Each visualization should aim to answer a specific question from the dataset.
Clarity and Greatness: A good visualization answers a question clearly, while a great one hints at the question itself.
Types of questions answered through visualizations:
- Descriptive: E.g., What are the largest 7 landmasses on Earth?
- Exploratory: E.g., Is there a relationship between penguin body mass and bill length?
- Other types: inferential, predictive, causal, mechanistic.

Creating Visualizations in R

ggplot2: Key components include:

Aesthetic Mappings (aes): map dataframe columns to visual properties
Geometric Objects (geom): encode HOW TO DISPLAY those visual properties
Scales: transform variables, set limits

Elements are combined using the + operator.

Types of Variables

Categorical Variables: can be divided into groups (categories) e.g. marital status
Quantitative Variables: measured on numeric scale (e.g. height)

Visualization Types

Scatter Plots

To visualize relationship between 2 QUANTITATIVE variables
Example Question: How does horsepower affect the fuel economy of an engine?

Line Plots

To visualize TRENDS w/ respect to an INDEPENDENT QUANTITY (e.g. time)
Example: Changes in atmospheric CO2 levels over the last 40 years.

Bar Plots (there’s SPACING BETWEEN BARS)

To visualize COMPARISON of AMOUNTS
Example: Number of students in each faculty (sci, bus, med, arts)

Histograms (NO space between bars)

To visualize DISTRIBUTION OF SINGLE QUANTITATIVE VARIABLE
Example: Differences in life expectancy across continents in 2016.

Rules of Thumb for Effective Visualizations

Avoid using tables, pie charts, and 3D visualizations for clarity and simplicity
Use color palettes that are simple and colorblind-friendly
A great visualization conveys its story without requiring additional explanation

Refining Visualizations

Convey message, minimize noise

use legends and labels
ensure text is big enough to read and clearly visible
be wary of overplotting

Explaining the Visualization

Presented visualizations should form a narrative:
1. Establish the setting and scope.
2. Pose the question.
3. Use the visualization to answer the question.
4. Summarize findings and motivate further discussion.

Saving the Visualization

Choose appropriate output formats based on needs:

Raster Formats: 2D grid, often compressed before storing so takes less space
- JPEG (.jpg, .jpeg): lossy, usually for photos
- PNG (.png): lossless, usually for PLOTS / LINE drawings
- BMP (.bmp): LOSSLESS, RAW IMAGE data, no compression (rarely used)
- TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing
Vector Formats: collection of mathematical objects (lines, surfaces, shapes, curves)
- SVG (.svg): general purpose use
- EPS (.eps): general purpose use, infinitely scaleable great for presentation

Function	Description
`geom_point()`	To make scatterplot
`geom_line()`	To make line plot
`geom_histogram()`	To make histogram
`geom_bar()`	To make bar graph
`geom_vline()`	Adds VERTICAL line to a plot, to INDICATE SPECIFIC X-AXIS e.g. threshold, average
`geom_hline()`	Adds HORIZONTAL line to a plot, to INDICATE SPECIFIC Y-AXIS e.g. threshold, average
`aes()`	To create aesthetic mapping
`geom_*`	To create geometric object
`xlim` and `ylim`	Sets limits for x-axis/y-axis in plot, controls range of values displayed
	For aes mapping, fills in BARS by specific color or separates counts by a variable aside form the x/y axis
`dodge`	Prevents chart from being stacked, PRESERVES VERTICAL position of plot while adjusting horizontal position
`color`	To distinguish different groups/categories by color
`shape`	Allows further visualization by varying data points by shape
`labs()`	For labels, axes
`facet_grid` `rows` argument `cols` argument `vars()`	To create a plot that has MULTIPLE SUBPLOTS arranged in grid e.g. `facet_grid(rows = vars(column))` `rows` if plot is to be split HORIZONTALLY `cols` if plot is to be split VERTICALLY `vars()` to correctly evaluate column names
`ggsave`	To save plots in various formats
`\n`	To create line break in axis names
`scale_fill_brewer(palette = "...")`	For fill aes w/ CATEGORICAL variable
`scale_fill_distiller(palette = "...")`	For fill aes w/ NUMERIC variable
`scale_x_log10` `scale_y_log10`	To accomplish logarithmic scaling when AXES have LARGE number, also FORMAT AXIS LABELS to increase readability e.g. `scale_x_log10(labels = label_comma())`
`order_by` `n`	name of the column we want to use for comparing e.g. `islands_top12 <- slice_max(islands_df, order_by = size, n = 12)` n is how many rows to keep
`fct_reorder()`	In aes mapping, to reorder factors based on values of another variable e.g. `y = fct_reorder(landmass, size, .desc = TRUE)`
`as_factor`	Convert variables to factor type Factor: data type used to represent categories

Other

add alpha = 0.2 to geom_point to enhance visibility when points overlap
theme(text = element_text(size = 12), legend.position = "top", legend.direction = "vertical")
options(repr.plot.width = 8, repr.plot.height = 8) to set plot size
geom_bar(stat = "identity") to DISPLAY VALUES in data frame AS IS
geom_bar(position = "identity") to DISPLAY BARS exactly as they are w/o any stacking or dodging
geom_vline(xintercept = 792.458, linetype = "dashed") to make certain lines stand out
geom_histogram(bins, binwidth)

Week 5: Version Control

What’s Version Control?

Version Control: the process of keeping a record of changes to documents, indicating changes made, and who made them
- View earlier versions and revert changes
- Facilitates resolving conflicts in edits.
- Originally for software development, but now used for many tasks

Why Do We Need Collaboration Tools?

Initial approach: Sending files to teammates via email.
- Problems:
  - Unclear versioning and edit tracking.
  - No insight into who made edits or when.
  - Difficulty in reverting to prior versions.
  - Poor communication regarding tasks and issues.

Alternative approach: Sharing via Dropbox/Google Drive.
- Limitations:
  - Simple version management but lacks:
    - Edit tracking (who, when).
    - Clarity on project status after extended periods.
    - Easy means to revert changes.
    - Organized discussions for tasks/issues.
    - Complicated file naming like "final_revision_v3_Oct2020.docx" instead of clear version history.

Git and GitHub

Git → Version Control Tool
- Tracks files in a repository (folder that you tell Git to pay attention to)
- Responsible for keeping track of changes, sharing files, and conflict resolution
- Runs on local machines of all users
GitHub → Cloud Service, Online Repositories
- Cloud service that hosts repositories
- Manages permissions (view/edit access)
- Provides tools for project-specific communication (issues tracking)
- Can build and host websites/blogs

Version Control Repositories

Two copies of the repository are typically created:

Local Repository: Personal working copy on OWN COMPUTER
Remote Repository: Copy of project hosted on server ONLINE, for sharing w/ others

Both copies maintain a full project history with commits (snapshots of project files at specific times).
Commits include a message and a unique hash as identifiers.

Key Concepts and Commands

Overview of important files and areas:
- Working Directory: Files like analysis.ipynb, notes.txt, and README.md on the local computer.
- Staging Area: Where changes are prepared for a commit (.git).
- Remote Repository: GitHub stores your files.

Version Control Workflows

Committing changes: Save your changes in the local repo

Working Directory: Contains files that may have pending changes
Git Add: command to stage selected files to prepare for committing
- Typical files to stage: analysis.ipynb, README.md.
Git Commit: takes snapshot of files in the staging area, recording changes
- Includes message to describe the nature of the changes

Pushing changes: Share commits with remote repo

Git Push: UPLOAD UPDATES to the REMOTE repo on GitHub

Keeps COLLABORATORS SYNCHRONIZED w/ latest changes
When want to back up work

Pulling changes: COLLECTING NEW CHANGES others added on remote repo to your local machine

Git Pull: download file to computer
To SYNCHRONIZE LOCAL repo w/ changes made by collaborators on the remote repo

Cloning a repo: copy/download ENTIRE CONTENT (files, project, history, location of repo) of a REMOTE server repo to LOCAL MACHINE

Git Clone: creates connection between local and remote repo
To GET a WORKING COPY of repo

Other

Git has distinct step of adding files to staging area:
- Not all changes made are ones we want to push to remote GitHub repo
- Allows us to edit multiple files at once
- Associate particular commit message w/ particular files (so they specifically reflect changes made)
GitHub Issues: NOT ideal for specific project communication
- Issues only persist while they are open, deleted when closed
CAN CLONE/PULL from any PUBLIC remote repo on GitHub, BUT can only push to public remote repo YOU OWN or have ACCESS to
Editing files on GitHub w/ pen tool: reserved for small edits to plain text files
Generate GitHub personal access token to authenticate when sending/retrieving work
Good practice to pull changes at start of every work session before working on local copy
Handling merge conflicts: open offending file in plain text editor & identify conflict markers (<<<<<<<, =======, >>>>>>>)
- Manually edit file to resolve the conflicts, ensuring to keep the necessary changes from both versions before saving & committing resolved file

Week 6: Classification 1 - Using K-Nearest Neighbours

Classification Concept

Example: Diagnosing cancer tumor cells
- Labels: "benign" or "malignant"
- Questions arise about new cases based on features (e.g., Concavity and Perimeter)
- Classification methods aim to answer questions regarding labels based on data

K-Nearest Neighbours (KNN) Classification

Predict label/class for a new observation using K closest points from dataset

Compute distance between new observation and each observation in training set, to find K nearest neighbours
- can go beyond 2 predictors
Sort data in ascending order according to distances
Choose the top K rows as “neighbors”
Classifying new observation based on majority vote

`Tidymodels` Package in R

tidymodels: Collection of packages and handles computing distances, standardization, balancing, and prediction

Load the libraries and data

Necessary libraries include tidyverse and tidymodels
Import data and mutate as necessary

Make a recipe to specify predictors/response and preprocess data

recipe(): Main argument in formula
- Arguments: formula and data
prep() & bake()

Build model specification (model_spec) to specify model and training algorithm
- model type: kind of model you want to fit
- arguments: model parameter values
- engine: underlying package the model should come from
- mode: type of prediction
Put them together in a workflow and then fit it
Predict a new class label using predict

Importance of Standardization

Standardization: important to account for all features in distance calculation, enhances prediction accuracy

Adjusting for center and spread
Shift and scale data to carry average of 0 and standard deviation of 1

Non standardized Data:

Issues arise if one variable's scale is significantly larger than others

Function	Description
`recipe()`	To prepare data for modelling task
`step_center()`	Centers numeric variables and standardize data
`step_scale()`	Scales numeric variables and standardize data
`prep()`	Prepares recipe for modelling by estimating required values from training data
`bake()`	Applies preprocessing steps to new data

Week 7: Classification 2 - Evaluating & Tuning

Randomness and Seeds

set.seed() EXACTLY ONCE AT BEGINNING OF EVERY ANALYSIS

Randomness: crucial for reproducibility, ensures analysis is fair, unbiased, and not influenced by human input

Use Random Seeds: Ensures consistent results across iterations of analysis in R
Call set.seed() to initialize the random number generator for consistent outputs.

Evaluating Performance

Key Concept: A classifier doesn't need 100% accuracy but must minimize wrong predictions

Importance of evaluating classifiers on new data (test set) not seen during training
Split data: Use initial_split() to separate into training (75%) and testing (25%).
Golden rule: Do not use test data for model building

Example: Breast Cancer Data

Predict benign vs malignant tumors from biopsy images using a classifier.

Measure accuracy:

Accuracy and Confusion Matrix

Accuracy: correctness overall
Confusion Matrix: shows how many test set labels of each type are predicted correctly and incorrectly
- Example confusion matrix:
  - True Positive (TP): 1
  - False Positive (FP): 3
  - True Negative (TN): 57
  - False Negative (FN): 4
Performance metrics based on confusion matrix:
- Accuracy: 89%
- Precision: 20%
- Recall: 25%
Positive label: the label we are more interested in identifying

Precision: consistency, quantifies how many of positive predictions the classifier made were actually positive

Recall: found all the relevant items, quantifies how many of the positive observations in the test set were identified as positive

Evaluating Performance with `tidymodels`

Workflow using tidymodels

Load necessary packages and set the random seed.
Split data: Use initial_split() to separate into training (75%) and testing (25%)
Preprocess data using recipe for standardization (important for K-NN accuracy) USING ONLY TRAINING DATA
Train Classifier: Specify model parameters, fit K-NN model on training data.
Predict on Test Set: Assess model performance by predicting labels for the test set
Performance Metrics: Leverage metrics() to evaluate accuracy, precision, and recall

Critical Evaluation of Performance

Context of application defines acceptable accuracy and trade-offs between precision and recall.
Always compare classifier performance against a simple majority classifier

Tuning the Classifier

Importance of K parameter in K-NN and its impact on model performance.
Cross-Validation: Repeat process to evaluate accuracy using different splits of data.
Parameter Grid: Specify values to optimize during tuning and establish best-performing parameters.

Under/Overfitting Concepts

Underfitting: High K leads to poor model complexity; oversimplifies and loses predictive capacity.
Overfitting: Low K considers too much noise; less reliable predictions on new data.

Predictor Variable Selection

Effects of Irrelevant Predictors

Inclusion of irrelevant predictors harms model accuracy by introducing noise
Utilize systematic methods for variable selection by removing non-informative variables

Forward Selection Method

Iterative approach to build models with increasing sets of predictors.
Allows focusing on maintaining accuracy while minimizing predictor variables.

Function	Description
`set.seed()`	Ensures consistent results from random processes in analyses
`sample()`	To randomly pick a number `random_num <- sample(0:9, 10, replace = TRUE)`
`initial_split()`	To split dataset into training and testing sets
`metrics()`	Calculates performance metrics for a model’s predictions, assess how well model is performing on test data
`glimpse()`	To view data with large number of columns
`bind_cols()`	To add column of predictions to original data set
`tune_grid()`	To fit the model for each value in a range of parameter values

Knowt

0.0(0)

Generate Practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

DSCI NOTES Week 1-6

Week 1: Introduction to Data Analysis

Types of Data Analysis Questions

Analysis Tools

Data Set Structure

R Messages and Functions

Naming Conventions in R

Filtering and Selecting Data: Filter THEN Select

Mutating Data

Data Visualization with ggplot

Week 2: Reading Data

Types of File Paths

Loading data from computer (Workflow): DO IT CAREFULLY

Reading Data Functions

Skipping Rows

Working with Databases (SQLite, PostgreSQL)

Reading data from database

Writing Data to Files

Week 3: Wrangling

What is Tidy Data?

Data Structures in R:

Characteristics of Tidy Data

Examples of Tuberculosis Data

Tools for Tidying and Wrangling Data

Step-by-Step Data Transformation

Chaining Operations with R

Grouping and Summarizing Data: group_by() + summarize()

Introduction to Iteration

Other

Functions for Data Wrangling

Week 4: Visualization

Designing Visualizations

Creating Visualizations in R

Types of Variables

Visualization Types

Scatter Plots

Line Plots

Bar Plots (there’s SPACING BETWEEN BARS)

Histograms (NO space between bars)

Rules of Thumb for Effective Visualizations

Refining Visualizations

Explaining the Visualization

Saving the Visualization

Other

Week 5: Version Control

What’s Version Control?

Why Do We Need Collaboration Tools?

Git and GitHub

Version Control Repositories

Key Concepts and Commands

Version Control Workflows

Other

Week 6: Classification 1 - Using K-Nearest Neighbours

Classification Concept

K-Nearest Neighbours (KNN) Classification

Tidymodels Package in R

Importance of Standardization

Week 7: Classification 2 - Evaluating & Tuning

Randomness and Seeds

Evaluating Performance

Accuracy and Confusion Matrix

Evaluating Performance with tidymodels

Tuning the Classifier

Under/Overfitting Concepts

Predictor Variable Selection

Data Visualization with `ggplot`

Grouping and Summarizing Data: `group_by()` + `summarize()`

`Tidymodels` Package in R

Evaluating Performance with `tidymodels`