Chapter 1-10: Machine Learning — Key Vocabulary for Lecture Review
SEIS 763 Section 2 — Machine Learning: Introduction and Course Overview
Welcome and logistics
- Instructor: Shrey, pronouns he/him/his; based in Minneapolis; in-person lectures preferred when possible; Zoom available; lectures streamed and recorded
- Recording and access
- Lectures are recorded and available on Canvas (in Zoom Pro section)
- Slides (PDF) uploaded ~5 minutes before the lecture; students should keep slides up-to-date themselves as corrections may be made during the lecture
- Communication and participation
- Lectures are intended to be a dialogue; raise questions in class or via chat (not always monitored live); virtual hand-raise recommended
- Course schedule and logistics
- Course: SEIS 763, Section 2, Machine Learning
- Schedule: Tuesdays, 17:30–20:30, Oven Science Hall (in-person) or same Zoom link for online attendance
- Canvas is the hub for syllabus, modules, homework, exams, and the project
- Assessment and evaluation overview
- Four key components, each 25%: homework, midterm, final, group term project
- Midterm: coding-based; covers the first six lectures; limited internet access; no generative AI tools allowed
- Final: conceptual with some math; closed-notes
- Project: group project; milestones include a proposal, a presentation, and a report
- Onalock and practice assessments
- Exams are monitored by Onalock; practice/mock exams will be provided (e.g., to prepare for coding components)
- If you encounter issues with Onalock, contact support; instructors can’t fix technical Onalock issues during the exam
- Working with Python and environments
- The course emphasizes foundational concepts and Python-based ML workflows using off-the-shelf libraries (e.g., scikit-learn)
- This is not a Python tutorial: prerequisite Python knowledge assumed; labs show Python usage, but debugging is the student’s responsibility
- Generative AI policy
- Generative AI could aid in learning, but not in exams; using it for assignments is allowed at times, but you must understand what you submit; exam integrity and your own understanding are prioritized
- Course structure and intent
- Aim: build foundational ML concepts and the math that governs ML; learn to implement using Python libraries
- Emphasis on standard training pipelines and intuition for when to apply different algorithms
- The course avoids a deep-dive into deep learning, reinforcement learning, CV, NLP, or generative AI; those topics appear in later courses (e.g., AI course)
Instructor background and personal context
- Personal and professional background
- Mechanical engineering undergraduate and master’s; PhD in medical robotics and AI from UIUC; focus areas include computer vision, ML, brain/body signals, VR/AR
- Current work: leads computer vision at Cargill; video and image analytics, production planning
- Prior experience: internships/roles at Target; work at HCSC (Health Care Engineering System Center) on therapy-assistive robots and brain/muscle activity monitoring
- Interests outside work
- Photography and astrophotography; images include Milky Way core, Orion Nebula, Andromeda Galaxy
- Interest in Formula One, reading (ranging from political/history to sci-fi with cat aliens and lizard aliens), and Harry Potter
- Educational and research focus
- Research areas include computer vision, robotics, reinforcement learning, biomechanical signals, VR/AR; labs contributed to pioneering work in VR in the 1990s
- Teaching philosophy
- Emphasizes a dialogue in lectures; not a monologue; willingness to clarify and follow up after class if needed
Course goals, scope, and prerequisites
- Goals
- Build foundational ML concepts and the math that governs ML
- Learn to apply concepts using Python and standard libraries (e.g., scikit-learn)
- Understand standard training pipelines and when to adapt them
- Scope and limitations
- Not a Python tutorial; labs provide Python practice and boilerplate code, but debugging is student responsibility
- Not a deep dive into advanced topics (deep learning, reinforcement learning, CV, NLP, generative AI)
- No model deployment content in this course (ML engineering content reserved for later courses)
- Prerequisites
- Python-based prerequisites from courses the student should have completed
- If Python is rusty, links and resources are provided in module sections for self-study
- Resource organization
- Course content organized in Canvas modules; module 1 is mandatory to access today’s module
Course philosophy: theory vs practice and library choices
- Balancing theory and practice
- ML is both mathematically grounded (linear algebra, calculus) and practice-oriented (coding, pipelines)
- The course aims for a middle ground: open-source Python libraries offer high-level implementations with tunable parameters
- Tooling choices
- Preferred library: scikit-learn (SKLearn); widely used and well-documented
- Python environments and notebooks (Jupyter) are used for labs and demonstrations
ML basics: AI, ML, and DL; historical context
- Clarifying terms
- AI (Artificial Intelligence): broad field encompassing any system exhibiting intelligent behavior (rule-based systems, chess bots, etc.)
- ML (Machine Learning): subset of AI focusing on pattern recognition and learning from data
- DL (Deep Learning): subset of ML using artificial neural networks
- Generative AI is a popular form of AI today but is only one application area within AI
- Brief history of AI and ML milestones
- 1956: Dartmouth Conference establishes AI as a field
- 1957: Perceptron (early neural network) introduced
- 1986: Hinton et al. revive neural networks (neural networks gain popularity later via GPUs)
- 1997: IBM Deep Blue defeats Kasparov in chess
- 2011: Watson beats Jeopardy champions
- 2014: Facebook DeepFace demonstrates strong facial recognition capabilities
- 2012–2014: AlexNet demonstrates depth of CNNs, GPUs, and deep learning efficacy
- 2015: Go playing AI by reinforcement learning (e.g., AlphaGo family milestones)
- 2020: OpenAI GPT-3 demonstrates large-scale language modeling and generation
- What ML is, conceptually
- Given features x, predict target y
- The key problem: learn weights w that assign relative importance to features so that wᵀx approximates y
- Analogy: 20 questions game—features help narrow down the possible labels; the model learns which features matter most
- Core components of a machine learning system (in this course context)
- Features: clues/descriptors used to infer the label
- Label: ground-truth answer or category
- Model: the learnable function mapping features to predictions
- Loss function: measures how far predictions are from ground-truth
- Optimizer: updates model parameters to minimize the loss
- Simple example: 8 as a digit with three features
- Features (descriptors): x1 = distance between start/end, x2 = curved edges, x3 = intersects itself
- Label y = 8
- Feature values might be x = [startenddistance, curvededges, intersectsonce]ᵀ
- Weights w (learned): wᵀx approximates y; a bias (b) can shift the prediction
- Ground truth and learning dynamics
- Ground truth provides target y for each x during training
- Learning is trial-and-error: adjust w to reduce prediction error
- Data quality and label reliability
- “Garbage in, garbage out”: ground-truth quality controls model performance
- Models cannot easily correct incorrect human labels; some techniques exist to fix labeling errors but supervision quality matters
- Model evaluation and progress signal
- Progress is tracked via a metric (e.g., accuracy or error rate); improvement indicates learning direction
- Practical implications and limitations
- 100% accuracy is rare; often indicates a toy problem or overfitting; in real data, there is noise, ambiguity, or label error
- Overtraining/overfitting occurs when a model fits the training data too well and fails to generalize
Mathematical formulation of a linear model (from lecture examples)
- Regression setting (continuous y)
- One-feature example: y = m x + c
- We replace slope m with weight w and intercept c with bias b in ML notation: y = w x + b
- Multi-feature generalization
- For a vector x ∈ R^d, and weight vector w ∈ R^d, a single prediction is:
ext{prediction } \hat{y} = \mathbf{w}^T \mathbf{x} + b - For a dataset with n samples, you can write the batch form as:
\mathbf{y} = \mathbf{X} \mathbf{w} + b \mathbf{1}
where X ∈ R^{n×d}, w ∈ R^{d×1}, b ∈ R, and \mathbf{1} ∈ R^{n×1} - In many ML texts, the bias term b is added to every prediction; equivalently, you can augment x with a constant feature 1 so that the bias becomes part of the weight vector: y = w̃ᵀ x̃, with x̃ = [x; 1] and w̃ = [w; b]
- Interpretation of weights and bias
- Weight w_i reflects the importance (discriminative power) of feature i
- A larger magnitude of w_i means feature i has greater influence on the predicted y
- The bias term b shifts the regression plane up or down to better fit data
- Takeaway from the linear model example
- If the best fit line has slope around 1 and intercept ~0, then y ≈ x; if intercept is moved upward (positive bias), predictions increase by that amount across all x values
- Conceptual link to the feature-importance metaphor
- In higher dimensions, a weight vector w assigns a slope along each feature axis
- The learned w captures how changes in each feature affect the output, guiding predictions on new data
Take-home intuition on features, labels, and data representations
- Features (x): vector of clues used to infer the label
- Label (y): the target value or category the model should predict
- Ground truth: actual observed label used during training
- Importance of multiple samples
- Real-world tasks require many samples with diverse feature representations to generalize well
- Why you rarely get exact 100% accuracy
- Real data are noisy; humans may label imperfectly; handwriting or sensor data can be ambiguous; models must generalize beyond observed samples
Practical ML workflow concepts mentioned in the lecture
- Training vs evaluation: models learn by minimizing a loss on labeled data; performance is assessed on unseen data to evaluate generalization
- Gradient descent and learning rate
- Training typically involves updating weights to minimize loss; learning rate governs step size in weight updates
- Feature engineering and data richness
- More informative features (and a thoughtful feature set) help disambiguate similar inputs
- Data quality and labeling
- Clean, accurate labels improve model performance; beware of biased or incorrect ground truth
Python, NumPy, and Pandas: quick primer from the lab portion
- Python basics (recap)
- Variables, types (int, float, string, bool)
- Basic operations, if statements, for/while loops
- Functions with return values; Python f-strings for readable output
- Data structures
- Lists and dictionaries; indexing, negative indexing; list/dictionary operations
- NumPy fundamentals
- Arrays (1D and 2D), zeros, ones, identity matrices
- Array shape, indexing, slicing, and basic elementwise operations
- Matrix multiplication rules: shapes must align
- Transpose and reshape operations; broadcasting concepts conceptually
- Random numbers and random seeds for reproducibility: use a fixed seed to reproduce experiments
- Pandas essentials
- DataFrame as the primary data structure for tabular data
- Creating a DataFrame from a dictionary; loading data from CSV/Excel with readcsv/readexcel
- df.head(), df.tail(), df.info(), df.describe() for quick data understanding
- Display vs print for richer interactivity in notebooks
- Selecting, filtering, and indexing: df['col'], df[['col1','col2']], df.loc/df.iloc
- Setting and changing index (e.g., by date) to enable label-based queries
- Adding new columns via vectorized expressions (e.g., tip percentage = tip/bill * 100) and using apply for row-wise operations
- Data transformations: groupby (e.g., by day), aggregations (mean, sum), and multi-key groupings
- Handling missing values and data types (float, int, object)
- Visualization with Matplotlib
- Basic plots: scatter plots, line plots, bar charts, histograms
- Plot customization: color, labels, titles, legends, grid, markers, and regression line overlays
- Pandas plotting convenience: df.plot with column specifications
- Subplots: combining multiple plots in a single figure with plt.subplots or plt.subplot
Assignment 1: data analysis with 20 Delta flights (Minneapolis → Amsterdam)
- Data setup and content
- Folder and files: data/ contains 20 CSV files, each with time-stamp (epoch), position, altitude, speed, heading (direction), and delta (GPS coordinates)
- Tasks and expectations
- Q1: Load all 20 CSV files into a collection (e.g., list or dict of DataFrames) using a loop; print the loaded file name; display first/last rows and DataFrame.info for at least one file
- Q2–Q? onwards: For all files, create new human-readable date and time columns by converting UTC epoch timestamps
- Q3: For each flight, determine departure and arrival times
- Departure time: first row where speed > 0 (i.e., takes off)
- Arrival time: first time after takeoff where speed returns to <= 0 (airborne duration considered; 50% rule ensures takeoff/landing are well-defined)
- Output format using f-strings like: "Flight {name}: Departure = {date} {time}, Arrival = {date} {time}, Duration = HH:MM"
- Q4: Plot total flight time per flight as a bar chart (x-axis: departure date, y-axis: duration in hours:minutes)
- Q5: Plot the average flight time as a horizontal bar/line chart
- Q6: Split the flight time graph into three categories (ground before takeoff, in the air, ground after landing) based on takeoff/landing logic
- Q7: Create a data analytics section: identify any delays and their causes; define a delay criterion (e.g., delays > 10% of mean duration) and document reasons (in-air, ground, or post-landing delays)
- Q8–Q9: Optional bonus: infer which delay category was responsible for delays using data patterns; provide hints and implement if possible
- Q10: Ungraded (optional, extra credit)
- Output and submission guidelines
- Submit a notebook (.ipynb) and an HTML export; data file and a zipped folder named with the student’s last name and first name (to avoid name loss on unzip)
- Notebook structure: first cell as Markdown with student name; each question followed by code cells; alternate between Markdown and Code cells
- The examiner will review notebooks and HTML exports; no automated per-question feedback; announcements for common mistakes
- Presentation and evaluation details
- The assignment emphasizes Python fundamentals (Pandas and Matplotlib) over advanced ML topics
- Datasets and notebooks provided to illustrate data wrangling, feature engineering, and visualization workflows
Take-home exercise and examples discussed in class
- Handwriting recognition intuition (digits 0–9): features vs labels
- Features: x1 = distance between start and end points; x2 = curvature; x3 = whether the stroke intersects itself; other potential features could include height, width, bounding box metrics
- Label: the digit (0–9), e.g., 8 in the example
- Idea: each digit can be described by a vector of feature values; learning finds a weight vector w to map features to the correct label
- Ambiguities and handwriting variability
- Digits can be drawn in multiple styles (e.g., a 4 drawn differently); more features help disambiguate digits and improve generalization
- This is why large and diverse datasets are essential for ML models to generalize
- Feature engineering and data dimensionality
- With more features, you can better separate classes; but more features require more data to learn robust weights
- Dimensionality and shapes intuition (for ML notation)
- For a single sample: x ∈ R^d, w ∈ R^d, y ∈ R
- For a dataset with n samples: X ∈ R^{n×d}, w ∈ R^{d×1}, b ∈ R, and ŷ ∈ R^n
- Batch prediction: ŷ = X w + b; Each ŷi = w^T xi + b
- Takeaway about learning dynamics and data representation
- The model learns w and b such that wᵀx approximates y across samples
- If you add bias, you can account for systematic offsets in the data
Quick notes on study and practice strategy
- Start early on the first assignment; it’s designed to be challenging and to test Pandas/Matplotlib proficiency
- Use the provided tutorials and module resources if you are rusty with Python, NumPy, and Pandas
- Practice with Jupyter notebooks to build familiarity with the workflow: load data, inspect, transform, and visualize
- Remember the distinction between training error and generalization error; strive for representative data splits when you practice on your own
Quick reference: common commands and concepts mentioned in the lecture
- Python basics
- Printing, variables, types, if/for/while, functions, string formatting (f-strings), lists, dictionaries
- NumPy basics
- Arrays: vectors and matrices; zeros, ones, identity; shape; indexing; slicing; broadcasting; elementwise operations; matrix multiplication
- Random numbers and seeds for reproducibility
- Basic statistics (mean, std, min, max) and simple transformations
- Pandas basics
- DataFrame, readcsv, readexcel, df.head(), df.info(), df.describe()
- Display vs print; column selection: df['col'], df[['col1','col2']]
- Indexing with loc/iloc, setting/changing the index, grouping with groupby
- Creating new columns with vectorized operations; apply for row-wise operations
- Basic plotting via df.plot as a numpy/pandas-friendly plotting interface
- Matplotlib basics
- plt.scatter, plt.plot, plt.bar, plt.hist; labeling, legends, titles, grid, axis labels
- Subplots: plt.subplots or plt.subplot for multiple plots in one figure
- Saving and showing figures: plt.show()
Important reminders for students
- The midterm and final will test understanding of both conceptual and mathematical aspects; be comfortable with linear models, the meaning of weights and bias, and the dimension reasoning discussed in lecture
- Practice with hands-on lab materials and mock exams to become comfortable with the format and Onalock procedures
- For questions related to assignment and course logistics, use Canvas discussions or contact the instructor; announcements will be posted for general guidance
- Keep in mind the balance between theory and practice; aim to understand the intuition behind ML algorithms and how to implement them using Python libraries
Take-home exercise quick recap for exam prep
- Understand how to model an ML problem in the linear form y = wᵀx + b, where x is a feature vector and w contains weights that reflect feature importance
- Be comfortable translating a real dataset into X (features) and y (labels), and recognizing the role of the bias term in adjusting predictions
- Practice with simple two-feature examples (e.g., height, weight predicting gender) to build intuition about decision boundaries and weight interpretation
Summary takeaway
- This course lays the foundation for ML: from understanding AI vs ML vs DL, to framing problems in terms of features and labels, to implementing simple linear models with Python tools, and to building practical data science workflows using NumPy, Pandas, and Matplotlib
- Expect to iteratively learn, test, and refine with hands-on labs and assignments, while keeping an eye on data quality and the generalization capabilities of your models