ACCT 331 LECTURE 4
Vectors and Matrices: Why They Matter
Linear algebra is foundational for organizing data into vectors and matrices, enabling processing of entire datasets with a single operation rather than iterating over data points individually.
Benefits include faster computations, simplified math, and the ability to express many broad operations (loops, conditionals) as matrix operations.
The goal in algorithm optimization is to identify powerful techniques to optimize algorithms, especially in machine learning and data science.
Why is linear algebra critical?
Analogy: maps for data representation.
Vectors and matrices provide a universal, compact data representation so we don’t have to constantly reformat data for different programs.
They encode everything from text to images, graphs, and signals within a single, consistent framework.
They enable efficient computation at scale: computations become operations on matrices/vectors rather than per-item loops.
They are foundational to AI and data analysis; many models rely on linear algebra as building blocks.
They help shrink high-dimensional data into more tractable forms for computation, which is key in deep learning (e.g., forward/backward propagation).
Vectors and matrices as data structures
Vectors are an ordered list of numbers; each component represents a feature or characteristic (e.g., a data point’s attributes).
Matrices organize multiple vectors? (Rows/columns represent samples/features; see data representation below.)
Two main purposes of vectors:
Represent data in numeric form (machines don’t understand words/images directly).
Enable fast mathematical calculations on entire data points at once.
Example of a vector: a data point with 3 features (x1, x2, x3).
In AI, vectors are pervasive (e.g., word embeddings with many dimensions, often around 100).
Data representation in matrices
A typical data matrix has samples as rows and features as columns (or vice versa depending on convention).
Example: a row could represent one image. A 28×28 image (784 pixels) can be represented as a vector of length 784.
Image example: a 28×28 image is represented by a vector of length 784, i.e., x \,\in\, \mathbb{R}^{784}.
In practice, a dataset with n samples and p features is represented as a matrix X \in \mathbb{R}^{n \times p}, where each row is a sample and each column is a feature.
Confusion note from lecture: "each data client will occupy a space in a matrix with each column corresponding to a feature" but also an example where an image is a row of 784. The key idea is that matrices organize data for efficient computation; orientation (rows vs columns) depends on convention but the principle remains:
Each feature is a column (or each sample is a row), and vectorization allows row-wise or column-wise operations.
Feature concepts and data points
A feature is an independent variable with numeric or symbolic properties associated with the object of study.
Features encode the characteristics used to describe data points (e.g., age, weight, height for a person).
In text, images, or graphs, vectors encode the relevant attributes for mathematical processing.
In language models, words are embedded into vectors that capture semantic properties; typical vectors may have 100+ entries.
A data point (customer, image, word, etc.) is represented as a vector; the collection of features is the vector.
Matrix operations and intuition
Matrices enable bulk operations on data: e.g., multiplying a matrix by another to transform many data points at once.
Example of a matrix operation: multiply two 3×3 matrices A and B. The element in row i, column j of the product AB is:
(AB){ij} = \sum{k=1}^{3} A{ik} B{kj}In code (e.g., Python), you typically manipulate data as arrays (e.g., numpy arrays) rather than printing matrix literals; the numerical computations are accelerated by linear algebra libraries.
The data lifecycle: from algorithm to model
An algorithm (f) is trained on data to learn relationships and fit coefficients.
After training, the algorithm becomes a model that can predict outcomes on new data.
In the course, this leads into regression techniques and more advanced models.
Simple linear regression: setup and goals
Scenario: predict a continuous outcome from a single feature.
Model form (one independent variable):
y = w x + b
where y is the dependent variable, x is the independent variable, and w, b are coefficients learned from data.The goal is to choose w and b to best fit the observed data.
Least squares fitting
The typical method to fit the linear model is least squares: minimize the sum of squared residuals (errors) between observed and predicted values.
Predicted value for observation i: \hat{y}i = w xi + b
Residual for observation i: ei = yi - \hat{y}_i
Sum of squared residuals (SSE):
\text{SSE} = \sum{i=1}^{n} (yi - \hat{y}_i)^2
A smaller SSE means a better fit to the data.
The lecture notes emphasize: you do not necessarily need to memorize the formulas exactly, but you should understand that the goal is to minimize the discrepancy between observed and predicted values.
Model evaluation: loss and error
During training, we refer to the objective as a loss (or loss function) that we minimize.
After training, we evaluate model performance using error-based metrics (e.g., SSE, RMSE) and explanatory power metrics like R-squared.
Loss vs. error: in training, we speak of loss; in evaluation, we refer to error as the model’s deviation from true values.
Residuals, R-squared, and interpretation
Residuals: differences between observed and predicted values, ei = yi - \hat{y}_i.
R-squared (coefficient of determination) measures how much of the variance in the target is explained by the model:
R^2 = 1 - \frac{SS{\text{res}}}{SS{\text{tot}}} = 1 - \frac{\sum{i=1}^{n} (yi - \hat{y}i)^2}{\sum{i=1}^{n} (yi - \bar{y})^2} where SS{\text{res}} = \sum (yi - \hat{y}i)^2 and SS{\text{tot}} = \sum (yi - \bar{y})^2.If, as in the example, R^2 = 0.72, then about 72% of the variability in y is explained by the model.
Simple linear regression in practice
Example from lecture: predict housing prices using features such as number of bedrooms, square footage, year built, and distance to the nearest park.
A hypothetical line could yield a predicted price; the coefficients are found by minimizing SSE with the least squares method.
The graph of data and the regression line illustrates the fit and residuals.
What is linear regression good for? and limitations
Applications cited: finance (stock prices), predicting real estate values, risk assessment, healthcare, etc.
Linear regression serves as a baseline model and a building block for more complex AI methods.
In deep learning, neurons perform linear operations (a dot product plus bias) followed by nonlinear activation, illustrating why linear algebra is foundational.
Validation of a baseline model is common: compare several algorithms (3–5) and assess their errors/accuracy to choose a starting point.
Logistic regression: classification extension
When the task is classification (yes/no, spam/not spam, fraud/not fraud), linear regression is not suitable because its outputs are real-valued and not restricted to [0, 1].
Logistic regression starts with a linear model and then applies a nonlinear squashing function (sigmoid) to produce a probability.
Key formulation:
Linear score: z = w^\top x + b
Probability (class 1): P(y=1|x) = \sigma(z) where the sigmoid function is
\sigma(z) = \frac{1}{1 + e^{-z}}
The output is a probability, not a hard numeric value; a decision boundary can be chosen to classify (e.g., threshold 0.5).
Why not use linear regression for classification? Because the output can exceed [0,1] and there may be more than one class; logistic regression provides a probabilistic interpretation suitable for binary classification.
Use cases for logistic regression
Spam detection, fraud detection, credit approval, machine part failure prediction, healthcare diagnoses, etc.
In these problems, the model estimates the probability of a given class rather than a continuous outcome.
Connections to broader topics in AI and ML
Vectors and embeddings underpin many AI components, including language models, where words are represented as vectors in a high-dimensional space.
Matrix operations scale well with data size, enabling efficient training of large models.
Forward propagation and backpropagation (to be covered later) are built on linear algebra foundations (linear transformations and derivatives).
Quick recap: key terms and formulas
Model form (simple linear regression): y = w x + b
Predicted value: \hat{y}i = w xi + b
Residual: ei = yi - \hat{y}_i
Sum of squared residuals (SSE): \text{SSE} = \sum{i=1}^{n} (yi - \hat{y}_i)^2
Coefficient of determination (R-squared): R^2 = 1 - \frac{\sum{i=1}^{n} (yi - \hat{y}i)^2}{\sum{i=1}^{n} (y_i - \bar{y})^2}
Sigmoid function: \sigma(z) = \frac{1}{1 + e^{-z}}
Logistic regression probability: P(y=1|x) = \sigma(w^\top x + b)
Image vectorization example: a 28×28 image is represented as a vector of length 784: x \in \mathbb{R}^{784}
Matrix multiplication (general): (AB){ij} = \sum{k} A{ik} B{kj}
Next topics mentioned in the course
Forward propagation and backpropagation in deep learning
More regression types (beyond simple and multiple linear regression)
Practical application with housing price problems and model evaluation on real data
Assignment guidelines and practice problems to reinforce these concepts
Connections to real-world data science practice
Start with a simple baseline model (linear regression) to set a reference point before trying more complex models.
Use matrix/vector representations to scale computations for large datasets (e.g., image datasets with thousands of pixels per image).
Use regression for predicting continuous outcomes; switch to logistic regression for probabilistic classification tasks.
Always benchmark multiple models and report metrics like SSE and R-squared to quantify fit and explainability.
Homework and preparation notes
Practice computing SSE and R-squared by hand for small datasets to reinforce intuition.
Work through a housing price example, identifying features, building the simple linear regression model, and interpreting the coefficients.
Implement a logistic regression example with a binary target and a sigmoid output, and interpret the resulting probabilities.
YASMINE START
Vectors and Matrices: The Foundation of Data Science and AI
Linear algebra is crucial for efficiently organizing and processing data. Vectors and matrices transform entire datasets in single operations, leading to faster computations and simplified mathematical expressions. This compact data representation (analogy: maps for data) means we don’t need to reformat data for different programs, as they universally encode diverse data types like text, images, and signals. They serve as foundational building blocks for many AI and data analysis models, effectively reducing high-dimensional data for computation, especially in deep learning.
Data Structures: Vectors and Matrices
Vectors: Ordered lists of numbers where each component represents a feature. They are fundamental for:
Representing data numerically, as machines don't directly process words or images.
Enabling rapid mathematical calculations across entire data points.
Matrices: Organize multiple vectors, typically with samples as rows and features as columns (or vice versa). A dataset with n samples and p features is often a matrix X \in \mathbb{R}^{n \times p}.
Example: A 28x28 image (784 pixels) can be vectorized into a row of length 784, i.e., x \in \mathbb{R}^{784}. This allows for efficient row-wise or column-wise operations.
Features and Data Points
A feature is an independent variable (numeric or symbolic) describing an aspect of the object of study (e.g., age, height).
A data point (customer, image, word) is represented as a vector, where the collection of its relevant attributes constitutes the vector. Language models embed words into vectors to capture semantic properties.
Matrix Operations Intuition
Matrices facilitate bulk operations. For instance, multiplying matrices allows for simultaneous transformations of many data points.
General Matrix Multiplication: The element in row i, column j of the product of matrices A and B This operation is optimized by linear algebra libraries in practice.
The Data Lifecycle: From Algorithm to Model
An algorithm (f) is trained on data to learn relationships and fit coefficients. Once trained, it becomes a model to predict outcomes on new data. This concept underpins regression and other advanced AI models.
Simple Linear Regression: Predicting Continuous Outcomes
Goal: To predict a continuous dependent variable (y) from an independent variable (x).
Model Form: y = w x + b , where w is the slope (weight) and b is the y-intercept (bias). The objective is to learn optimal values for w and b from the data.
Mastery Detail: Least Squares Fitting
The primary method to fit a linear model is least squares, which minimizes the Sum of Squared Residuals (SSE).
SSE: The sum of the squares of these residuals across all n observations
A smaller SSE indicates a better fit; the coefficients w and b are chosen to minimize this value. Practice computing SSE for small datasets to solidify understanding.
Mastery Detail: Model Evaluation (Loss and Error, R-squared)
During training, the SSE is often referred to as the loss function, which is minimized. After training, the model's performance is evaluated using error metrics.
R-squared (R^2) or Coefficient of Determination: Measures the proportion of variance in the dependent variable (y) that is predictable from the independent variable(x).
An R^2 of 0.72 means 72% of the variability in y is explained by the model, indicating a good fit. Understanding both SSE (absolute error) and R-squared (relative explanatory power) is crucial.
Applications and Limitations
Applications: Finance (stock prices), real estate, risk assessment, healthcare.
Role: Serves as a fundamental baseline model and a building block for more complex AI methods. Validating a simple baseline is essential before moving to complex models.
Logistic Regression: Classifying Probabilistic Outcomes
Purpose: Used for classification tasks (e.g., binary outcomes like spam/not spam) where linear regression is unsuitable because its output is not constrained to probabilities ([0, 1]).
Key Formulation:
Linear Score (z): Similar to linear regression, a weighted sum of inputs plus a bias: z = w^\top x + b
Sigmoid Function (\sigma): This non-linear "squashing" function transforms the linear score into a probability between 0 and 1.
\sigma(z) = \frac{1}{1 + e^{-z}}Probability (Class 1): The output is the probability of the positive class given the input x: P(y=1|x) = \sigma(z)
A decision boundary (e.g., 0.5 threshold) is then applied to classify the outcome.
Why not linear regression for classification? Linear regression outputs can be any real number, which is not interpretable as a probability, and it struggles with more than two classes directly. Logistic regression naturally produces a probabilistic output for classification tasks.
Use Cases for Logistic Regression
Spam detection, fraud detection, credit approval, medical diagnoses, predicting equipment failure. In all these cases, the model estimates a probability of an event occurring.
Connections to Broader AI and ML
Vectors and Embeddings: Crucial in language models, representing words in high-dimensional semantic spaces.
Matrix Operations: Enable efficient scaling of computations for large datasets, fundamental for training in deep learning.
Forward and Back propagation: These core deep learning algorithms are entirely built upon linear algebra foundations, involving linear transformations and derivatives.
Final Thoughts for Mastery
Always start with a simple baseline like linear regression.
Master the interpretation of metrics like SSE and R-squared for continuous outcomes.
Understand why logistic regression is necessary for classification and how the sigmoid function provides probabilistic outputs.
Recognize that linear algebra is not just math, but the computational language of modern AI.