Detailed Notes: Machine Learning Workflow for a Course Recommender System
Problem framing and project overview
- Objective: Build a recommender system that suggests university courses.
- Recommendations are based on two criteria: (1) what students are good at (their caliber/strengths) and (2) current job market needs (skills in demand).
- The project is presented as a working scenario to illustrate a typical ML workflow, not as a coding exercise.
- The broader message: No matter the project, the machine learning workflow stays the same; some steps may be skipped depending on data quality and project specifics, but the core cycle remains.
- The project context and goal framing: imagine you are a data scientist given this brief; articulate objectives and outcomes, linking them to university missions and market demand.
- Core research question framing:
- Primary objectives and outcomes: why build the system? e.g., to help students apply knowledge to the market, improve employability, and support academic success.
- How the system supports the university’s mission: higher education quality, employability, and career readiness.
- How to measure success: what behaviors indicate a successful system? e.g., a generated list of courses that students adopt and find valuable.
- Potential challenges and limitations: data dependency is critical; data may be missing or insufficient; privacy and security concerns; licensing costs; storage needs.
- Stakeholders: end users (students) and end-user verification by faculty/advisers.
- Inputs and outputs: define what inputs look like (e.g., subject name, program name, list of interests) and the desired outputs (e.g., a list of recommended courses).
- The crucial warning about skipping steps:
- Skipping the problem-definition stage can lead to vague goals and misaligned outputs (example given: a textual description leading to outputs like course codes instead of course names).
- This stage sets the foundation for the entire workflow; without it, downstream steps become brittle.
- Practical takeaway: the problem-definition stage acts as a north star for all subsequent steps; do not skip it.
The universal machine learning workflow
- The workflow is the same across projects; there are stages that can be skipped based on data quality, but the cycle itself remains.
- The workflow consists of a cyclic sequence: define problem → collect data → preprocess data → split data → select/train model → evaluate → deploy → monitor/update → revisit earlier steps as needed.
- The workflow is iterative: you may loop back to earlier stages to improve data, adjust inputs, or change modeling approaches based on evaluation results.
Step 1: Define the problem
- What is the automation task? Define the problem in terms of inputs, outputs, and the expected behavior of the system.
- Outputs shape: What form should the system’s outputs take? (e.g., a ranked list of course recommendations with course names and reasons.)
- Important distinction: this is a proof-of-concept sketch on paper, not a fully coded system yet.
- Guiding questions (provided as a cheat sheet in the talk):
- What are the primary objectives and outcomes of the project?
- How does the system support the university’s mission (education quality and employability)?
- How do we measure success (what signals indicate a successful system)?
- What are potential challenges and limitations (data, privacy, governance, licensing, scalability)?
- Who are the stakeholders and end users (students, faculty, advisers)?
- What are the inputs and their formats, and what should be produced as output? (e.g., query with subject, program, interests → list of courses)
- Response dynamics during problem framing:
- Dialogue example shows two-way interaction with stakeholders and the instructor encouraging the student to articulate their own ideas before consulting a cheat sheet.
- Can we skip this stage? No. Skipping leads to vague requirements and misaligned outputs, as illustrated by a hypothetical scenario where outputs were not aligned with the input definitions.
- Example outputs to anchor thinking: a query with subject name, program name, and list of interests could yield a concrete list like:
- Applied machine learning, Cloud computing, …
Step 2: Data collection
- Do we need to collect data?
- If public data exists, reuse it and start training quickly.
- If data is not available, identify feasible data sources.
- Potential data sources for this recommender system:
- University data: list of courses offered by the university or similar institutions; program/course catalog.
- Student data: transcripts, enrollment history, GPA, interests (collected with consent).
- Job market data: current demand signals, job postings, required skills from job platforms.
- Legal and financial considerations:
- Licensing costs for data or data-collection services.
- Data privacy and security considerations:
- Data privacy is critical (FERPA compliance for student data).
- Consideration of who can access data and how it is stored/transferred.
- The risk of re-identification or leakage of private information (e.g., student IDs, GPAs, interests).
- Data collection methods and tools:
- Web scraping as a tool to collect data from public sources (e.g., job postings, course catalog pages) when allowed.
- Surveys or consent-based collection for sensitive fields (e.g., student interests).
- Data transfer protocols and secure storage practices.
- Practical note: data collection decisions should align with privacy and security requirements and ensure that the data is legally usable for model development.
Step 3: Data processing and quality assessment
- Data quality challenges commonly encountered:
- Missing values in both student data and job market data.
- Inconsistent formats (e.g., dates in different formats across sources).
- Imbalanced data: e.g., many samples for CS courses vs. few samples for others, leading to bias in recommendations.
- Example data quality issues shown in the talk:
- Two sample tables with missing rows and inconsistent date formats.
- Potential bias: if there are far more samples for one domain (e.g., CS), the model may overfit to that domain and underrepresent others.
- The importance of data completeness and richness:
- If the dataset is not rich enough, consider adding features to increase expressiveness (e.g., a salary column could be enriched with a median salary per role; adding a feature like days since a job post was posted).
- This step is optional if the data is already sufficiently complete, but it can significantly improve model learning.
- Possible actions if data is incomplete or biased:
- Collect more data, engineer additional features, or consider resampling to balance classes.
- Revisit data sources or definitions to reduce ambiguity.
- Tools and practices referenced:
- Data handling with pandas (e.g., handling missing values, normalizing formats).
- Can this step be skipped? Generally not, because data quality directly impacts model performance and fairness. Skipping can introduce fatal errors or biased outcomes.
Step 4: Data splitting
- Purpose: to evaluate model performance on unseen data and prevent overfitting.
- Recommended data split scheme (as described in the talk):
- Training set: N_{train} = 0.7 N
- Validation set: N_{val} = ext{between }0.15N ext{ and }0.25N
- Test set: N{test} = N - N{train} - N_{val} ext{ (the remaining portion)}
- Rationale: training data is used to learn patterns, validation data is used to challenge learning and tune hyperparameters, and test data provides an unbiased assessment of performance on completely unseen examples.
- Note on variability: the exact split can vary (e.g., 70/15/15 or 70/20/10) depending on data size and project needs.
- Question addressed in the talk: can we skip data splitting? No, splitting is essential for a reliable evaluation and to avoid leaking information from training into testing.
Step 5: Model selection and training
- The scenario focuses on generating lists of courses (recommendations) rather than a simple classification or regression task.
- Strategy for model selection:
- Start with open-source, accessible models suitable for recommenders or ranking tasks.
- Do not overcomplicate with heavyweight customization before establishing a baseline.
- Training focus: teach the model to produce a ranked list of courses given input features (e.g., subject, program, interests).
- Important caveat: this session does not dive into deep taxonomy of classification vs. regression; the emphasis is on a practical pipeline for a recommendation task.
Step 6: Model evaluation
- Why evaluation is necessary:
- To ensure the model generalizes to unseen data and provides outputs that are valuable to users.
- A well-trained model that performs poorly in real use is not useful; evaluation helps decide if deployment is warranted.
- Evaluation methods illustrated:
- Use training vs. validation loss charts to monitor learning and overfitting.
- Determine whether performance on the validation/test data meets the required quality before deployment.
- If evaluation signals poor performance:
- Return to earlier steps (e.g., data collection/processing) to improve data quality.
- Consider adjusting features, collecting more data, or changing the modeling approach.
Step 7: Deployment and web-based deployment considerations
- Deployment scenario described as a web-based application that could run in the cloud.
- Considerations for deployment:
- Where will the model be hosted and served (cloud platform choices, latency, availability)?
- How to ensure reliability and scalability as user load grows (e.g., 10k+ concurrent users).
- Backup plan (Plan B) in case of failures or outages.
- Infrastructure and tooling:
- Choice of cloud provider and deployment architecture.
- Monitoring and observability to detect performance degradation.
- Security and privacy implications continue to be important in deployment; ensure secure APIs and access controls.
Step 8: Monitoring, updating, and continuous improvement
- Why updates are necessary:
- Job market dynamics change over time; course offerings and student interests evolve; models can become stale if not updated.
- Update strategy considerations:
- Continuous monitoring of job market signals and student data; detect when changes warrant retraining or data collection updates.
- Version control for models and datasets to manage evolution over time.
- Version control and deployment/versioning strategies mentioned:
- Use naming conventions or GitHub to track different versions (e.g., versioned model artifacts).
- Plan for managing multiple versions of a trained model and data schemas.
Step 9: The cycle and ongoing nature of ML projects
- The ML workflow is cyclical, not linear:
- After deployment, monitor performance, collect new data, and revisit earlier steps as needed.
- Each cycle helps ensure the system remains relevant and reliable.
- The practical takeaway: treat ML projects as iterative, living systems rather than one-off implementations.
Step 10: Related anecdotes, examples, and tangents
- Generative AI model example:
- A company wanted synonyms for a generative AI task; used a pre-trained model and post-processing to achieve results due to the difficulty of running massive models locally.
- Image processing and perceptual systems (Netflix example):
- Pixels are numbers; images are processed by manipulating numerical representations.
- Netflix uses scene analysis to optimize streaming quality: if bandwidth is limited, it optimizes what to display to minimize buffering by focusing on regions of interest (e.g., backgrounds may be deprioritized).
- Real-world parallels and learning metaphors:
- Universities use exams to assess performance; similarly, ML models need evaluation to ensure they have learned to perform on unseen data.
- Practical caution: many of these ideas require careful consideration of data privacy and ethical implications, especially when handling student data or job market data.
Ethical, philosophical, and practical implications
- Data privacy and security:
- FERPA and related privacy protections when handling student records and identifying information.
- Data minimization and secure storage/transfer practices to prevent leakage of sensitive attributes (e.g., student IDs, GPAs, interests).
- Equity and bias considerations:
- Imbalanced data can bias recommendations toward overrepresented groups or domains (e.g., CS courses).
- Continuous auditing for fairness and accessibility across different student populations.
- Transparency and explainability:
- Stakeholders need understandable explanations for why certain courses are recommended; consider model-interpretability practices.
- Compliance and licensing:
- Ensure data sources and tools comply with licenses and terms of use; budget for data access costs.
Summary of key concepts and takeaways
- The two main criteria for course recommendations: student strengths and current market demand.
- The ML workflow is universal and cyclical; problem definition is foundational and non-negotiable.
- Data collection requires careful consideration of data sources, privacy (FERPA), and licensing costs.
- Data processing emphasizes data quality: missing values, inconsistent formats, and potential bias.
- Data splitting typically follows a train/validation/test scheme (commonly around 70\%/15-25\%/15-25\%), to evaluate generalization.
- Model selection for a recommender focuses on practical, open-source approaches rather than deep textbook classifications; prioritize a robust baseline.
- Evaluation uses diagnostic plots (e.g., training vs validation loss) and external criteria to judge usefulness to end users.
- Deployment requires cloud considerations, reliability under load, and a contingency plan.
- Updating is essential for aligning with evolving job markets and curricula; use version control to track changes.
- Real-world examples emphasize data privacy, ethical considerations, and the limits of machine understanding (e.g., interpretation of data as numbers, not semantic comprehension).
LaTeX references used in context of the notes
- Data split specification: N{train} = 0.7 N,\, N{val} = \alpha N,\, N{test} = N - N{train} - N_{val},\quad \alpha \in [0.15,0.25].
- Input/output example: inputs include \text{subject name},\text{program name},\text{list of interests}; outputs: a list of recommended courses (e.g., \text{Applied machine learning},\, \text{Cloud computing}).
- Conceptual relationships: the two primary criteria for recommendations can be described qualitatively as \text{Student strengths} \quad \&\quad \text{Market demand}.