Detailed Notes: Machine Learning Workflow for a Course Recommender System

Problem framing and project overview

Objective: Build a recommender system that suggests university courses.
- Recommendations are based on two criteria: (1) what students are good at (their caliber/strengths) and (2) current job market needs (skills in demand).
- The project is presented as a working scenario to illustrate a typical ML workflow, not as a coding exercise.
The broader message: No matter the project, the machine learning workflow stays the same; some steps may be skipped depending on data quality and project specifics, but the core cycle remains.
The project context and goal framing: imagine you are a data scientist given this brief; articulate objectives and outcomes, linking them to university missions and market demand.
Core research question framing:
- Primary objectives and outcomes: why build the system? e.g., to help students apply knowledge to the market, improve employability, and support academic success.
- How the system supports the university’s mission: higher education quality, employability, and career readiness.
- How to measure success: what behaviors indicate a successful system? e.g., a generated list of courses that students adopt and find valuable.
- Potential challenges and limitations: data dependency is critical; data may be missing or insufficient; privacy and security concerns; licensing costs; storage needs.
- Stakeholders: end users (students) and end-user verification by faculty/advisers.
- Inputs and outputs: define what inputs look like (e.g., subject name, program name, list of interests) and the desired outputs (e.g., a list of recommended courses).
The crucial warning about skipping steps:
- Skipping the problem-definition stage can lead to vague goals and misaligned outputs (example given: a textual description leading to outputs like course codes instead of course names).
- This stage sets the foundation for the entire workflow; without it, downstream steps become brittle.
Practical takeaway: the problem-definition stage acts as a north star for all subsequent steps; do not skip it.

The universal machine learning workflow

The workflow is the same across projects; there are stages that can be skipped based on data quality, but the cycle itself remains.
The workflow consists of a cyclic sequence: define problem → collect data → preprocess data → split data → select/train model → evaluate → deploy → monitor/update → revisit earlier steps as needed.
The workflow is iterative: you may loop back to earlier stages to improve data, adjust inputs, or change modeling approaches based on evaluation results.

Step 1: Define the problem

What is the automation task? Define the problem in terms of inputs, outputs, and the expected behavior of the system.
Outputs shape: What form should the system’s outputs take? (e.g., a ranked list of course recommendations with course names and reasons.)
Important distinction: this is a proof-of-concept sketch on paper, not a fully coded system yet.
Guiding questions (provided as a cheat sheet in the talk):
- What are the primary objectives and outcomes of the project?
- How does the system support the university’s mission (education quality and employability)?
- How do we measure success (what signals indicate a successful system)?
- What are potential challenges and limitations (data, privacy, governance, licensing, scalability)?
- Who are the stakeholders and end users (students, faculty, advisers)?
- What are the inputs and their formats, and what should be produced as output? (e.g., query with subject, program, interests → list of courses)
Response dynamics during problem framing:
- Dialogue example shows two-way interaction with stakeholders and the instructor encouraging the student to articulate their own ideas before consulting a cheat sheet.
Can we skip this stage? No. Skipping leads to vague requirements and misaligned outputs, as illustrated by a hypothetical scenario where outputs were not aligned with the input definitions.
Example outputs to anchor thinking: a query with subject name, program name, and list of interests could yield a concrete list like:
- Applied machine learning, Cloud computing, …

Step 2: Data collection

Do we need to collect data?
- If public data exists, reuse it and start training quickly.
- If data is not available, identify feasible data sources.
Potential data sources for this recommender system:
- University data: list of courses offered by the university or similar institutions; program/course catalog.
- Student data: transcripts, enrollment history, GPA, interests (collected with consent).
- Job market data: current demand signals, job postings, required skills from job platforms.
Legal and financial considerations:
- Licensing costs for data or data-collection services.
Data privacy and security considerations:
- Data privacy is critical (FERPA compliance for student data).
- Consideration of who can access data and how it is stored/transferred.
- The risk of re-identification or leakage of private information (e.g., student IDs, GPAs, interests).
Data collection methods and tools:
- Web scraping as a tool to collect data from public sources (e.g., job postings, course catalog pages) when allowed.
- Surveys or consent-based collection for sensitive fields (e.g., student interests).
- Data transfer protocols and secure storage practices.
Practical note: data collection decisions should align with privacy and security requirements and ensure that the data is legally usable for model development.

Step 3: Data processing and quality assessment

Data quality challenges commonly encountered:
- Missing values in both student data and job market data.
- Inconsistent formats (e.g., dates in different formats across sources).
- Imbalanced data: e.g., many samples for CS courses vs. few samples for others, leading to bias in recommendations.
Example data quality issues shown in the talk:
- Two sample tables with missing rows and inconsistent date formats.
- Potential bias: if there are far more samples for one domain (e.g., CS), the model may overfit to that domain and underrepresent others.
The importance of data completeness and richness:
- If the dataset is not rich enough, consider adding features to increase expressiveness (e.g., a salary column could be enriched with a median salary per role; adding a feature like days since a job post was posted).
- This step is optional if the data is already sufficiently complete, but it can significantly improve model learning.
Possible actions if data is incomplete or biased:
- Collect more data, engineer additional features, or consider resampling to balance classes.
- Revisit data sources or definitions to reduce ambiguity.
Tools and practices referenced:
- Data handling with pandas (e.g., handling missing values, normalizing formats).
Can this step be skipped? Generally not, because data quality directly impacts model performance and fairness. Skipping can introduce fatal errors or biased outcomes.

Step 4: Data splitting

Purpose: to evaluate model performance on unseen data and prevent overfitting.
Recommended data split scheme (as described in the talk):
- Training set: N_{train} = 0.7 N
- Validation set: N_{val} = ext{between }0.15N ext{ and }0.25N
- Test set: N{test} = N - N{train} - N_{val} ext{ (the remaining portion)}
Rationale: training data is used to learn patterns, validation data is used to challenge learning and tune hyperparameters, and test data provides an unbiased assessment of performance on completely unseen examples.
Note on variability: the exact split can vary (e.g., 70/15/15 or 70/20/10) depending on data size and project needs.
Question addressed in the talk: can we skip data splitting? No, splitting is essential for a reliable evaluation and to avoid leaking information from training into testing.

Step 5: Model selection and training

The scenario focuses on generating lists of courses (recommendations) rather than a simple classification or regression task.
Strategy for model selection:
- Start with open-source, accessible models suitable for recommenders or ranking tasks.
- Do not overcomplicate with heavyweight customization before establishing a baseline.
Training focus: teach the model to produce a ranked list of courses given input features (e.g., subject, program, interests).
Important caveat: this session does not dive into deep taxonomy of classification vs. regression; the emphasis is on a practical pipeline for a recommendation task.

Step 6: Model evaluation

Why evaluation is necessary:
- To ensure the model generalizes to unseen data and provides outputs that are valuable to users.
- A well-trained model that performs poorly in real use is not useful; evaluation helps decide if deployment is warranted.
Evaluation methods illustrated:
- Use training vs. validation loss charts to monitor learning and overfitting.
- Determine whether performance on the validation/test data meets the required quality before deployment.
If evaluation signals poor performance:
- Return to earlier steps (e.g., data collection/processing) to improve data quality.
- Consider adjusting features, collecting more data, or changing the modeling approach.

Step 7: Deployment and web-based deployment considerations

Deployment scenario described as a web-based application that could run in the cloud.
Considerations for deployment:
- Where will the model be hosted and served (cloud platform choices, latency, availability)?
- How to ensure reliability and scalability as user load grows (e.g., 10k+ concurrent users).
- Backup plan (Plan B) in case of failures or outages.
Infrastructure and tooling:
- Choice of cloud provider and deployment architecture.
- Monitoring and observability to detect performance degradation.
Security and privacy implications continue to be important in deployment; ensure secure APIs and access controls.

Step 8: Monitoring, updating, and continuous improvement

Why updates are necessary:
- Job market dynamics change over time; course offerings and student interests evolve; models can become stale if not updated.
Update strategy considerations:
- Continuous monitoring of job market signals and student data; detect when changes warrant retraining or data collection updates.
- Version control for models and datasets to manage evolution over time.
Version control and deployment/versioning strategies mentioned:
- Use naming conventions or GitHub to track different versions (e.g., versioned model artifacts).
- Plan for managing multiple versions of a trained model and data schemas.

Step 9: The cycle and ongoing nature of ML projects

The ML workflow is cyclical, not linear:
- After deployment, monitor performance, collect new data, and revisit earlier steps as needed.
- Each cycle helps ensure the system remains relevant and reliable.
The practical takeaway: treat ML projects as iterative, living systems rather than one-off implementations.

Step 10: Related anecdotes, examples, and tangents

Generative AI model example:
- A company wanted synonyms for a generative AI task; used a pre-trained model and post-processing to achieve results due to the difficulty of running massive models locally.
Image processing and perceptual systems (Netflix example):
- Pixels are numbers; images are processed by manipulating numerical representations.
- Netflix uses scene analysis to optimize streaming quality: if bandwidth is limited, it optimizes what to display to minimize buffering by focusing on regions of interest (e.g., backgrounds may be deprioritized).
Real-world parallels and learning metaphors:
- Universities use exams to assess performance; similarly, ML models need evaluation to ensure they have learned to perform on unseen data.
Practical caution: many of these ideas require careful consideration of data privacy and ethical implications, especially when handling student data or job market data.

Ethical, philosophical, and practical implications

Data privacy and security:
- FERPA and related privacy protections when handling student records and identifying information.
- Data minimization and secure storage/transfer practices to prevent leakage of sensitive attributes (e.g., student IDs, GPAs, interests).
Equity and bias considerations:
- Imbalanced data can bias recommendations toward overrepresented groups or domains (e.g., CS courses).
- Continuous auditing for fairness and accessibility across different student populations.
Transparency and explainability:
- Stakeholders need understandable explanations for why certain courses are recommended; consider model-interpretability practices.
Compliance and licensing:
- Ensure data sources and tools comply with licenses and terms of use; budget for data access costs.

Summary of key concepts and takeaways

The two main criteria for course recommendations: student strengths and current market demand.
The ML workflow is universal and cyclical; problem definition is foundational and non-negotiable.
Data collection requires careful consideration of data sources, privacy (FERPA), and licensing costs.
Data processing emphasizes data quality: missing values, inconsistent formats, and potential bias.
Data splitting typically follows a train/validation/test scheme (commonly around 70\%/15-25\%/15-25\%), to evaluate generalization.
Model selection for a recommender focuses on practical, open-source approaches rather than deep textbook classifications; prioritize a robust baseline.
Evaluation uses diagnostic plots (e.g., training vs validation loss) and external criteria to judge usefulness to end users.
Deployment requires cloud considerations, reliability under load, and a contingency plan.
Updating is essential for aligning with evolving job markets and curricula; use version control to track changes.
Real-world examples emphasize data privacy, ethical considerations, and the limits of machine understanding (e.g., interpretation of data as numbers, not semantic comprehension).

LaTeX references used in context of the notes

Data split specification: N{train} = 0.7 N,\, N{val} = \alpha N,\, N{test} = N - N{train} - N_{val},\quad \alpha \in [0.15,0.25].
Input/output example: inputs include \text{subject name},\text{program name},\text{list of interests}; outputs: a list of recommended courses (e.g., \text{Applied machine learning},\, \text{Cloud computing}).
Conceptual relationships: the two primary criteria for recommendations can be described qualitatively as \text{Student strengths} \quad \&\quad \text{Market demand}.