BS

Detailed Notes: Machine Learning Workflow for a Course Recommender System

Problem framing and project overview

  • Objective: Build a recommender system that suggests university courses.
    • Recommendations are based on two criteria: (1) what students are good at (their caliber/strengths) and (2) current job market needs (skills in demand).
    • The project is presented as a working scenario to illustrate a typical ML workflow, not as a coding exercise.
  • The broader message: No matter the project, the machine learning workflow stays the same; some steps may be skipped depending on data quality and project specifics, but the core cycle remains.
  • The project context and goal framing: imagine you are a data scientist given this brief; articulate objectives and outcomes, linking them to university missions and market demand.
  • Core research question framing:
    • Primary objectives and outcomes: why build the system? e.g., to help students apply knowledge to the market, improve employability, and support academic success.
    • How the system supports the university’s mission: higher education quality, employability, and career readiness.
    • How to measure success: what behaviors indicate a successful system? e.g., a generated list of courses that students adopt and find valuable.
    • Potential challenges and limitations: data dependency is critical; data may be missing or insufficient; privacy and security concerns; licensing costs; storage needs.
    • Stakeholders: end users (students) and end-user verification by faculty/advisers.
    • Inputs and outputs: define what inputs look like (e.g., subject name, program name, list of interests) and the desired outputs (e.g., a list of recommended courses).
  • The crucial warning about skipping steps:
    • Skipping the problem-definition stage can lead to vague goals and misaligned outputs (example given: a textual description leading to outputs like course codes instead of course names).
    • This stage sets the foundation for the entire workflow; without it, downstream steps become brittle.
  • Practical takeaway: the problem-definition stage acts as a north star for all subsequent steps; do not skip it.

The universal machine learning workflow

  • The workflow is the same across projects; there are stages that can be skipped based on data quality, but the cycle itself remains.
  • The workflow consists of a cyclic sequence: define problem → collect data → preprocess data → split data → select/train model → evaluate → deploy → monitor/update → revisit earlier steps as needed.
  • The workflow is iterative: you may loop back to earlier stages to improve data, adjust inputs, or change modeling approaches based on evaluation results.

Step 1: Define the problem

  • What is the automation task? Define the problem in terms of inputs, outputs, and the expected behavior of the system.
  • Outputs shape: What form should the system’s outputs take? (e.g., a ranked list of course recommendations with course names and reasons.)
  • Important distinction: this is a proof-of-concept sketch on paper, not a fully coded system yet.
  • Guiding questions (provided as a cheat sheet in the talk):
    • What are the primary objectives and outcomes of the project?
    • How does the system support the university’s mission (education quality and employability)?
    • How do we measure success (what signals indicate a successful system)?
    • What are potential challenges and limitations (data, privacy, governance, licensing, scalability)?
    • Who are the stakeholders and end users (students, faculty, advisers)?
    • What are the inputs and their formats, and what should be produced as output? (e.g., query with subject, program, interests → list of courses)
  • Response dynamics during problem framing:
    • Dialogue example shows two-way interaction with stakeholders and the instructor encouraging the student to articulate their own ideas before consulting a cheat sheet.
  • Can we skip this stage? No. Skipping leads to vague requirements and misaligned outputs, as illustrated by a hypothetical scenario where outputs were not aligned with the input definitions.
  • Example outputs to anchor thinking: a query with subject name, program name, and list of interests could yield a concrete list like:
    • Applied machine learning, Cloud computing, …

Step 2: Data collection

  • Do we need to collect data?
    • If public data exists, reuse it and start training quickly.
    • If data is not available, identify feasible data sources.
  • Potential data sources for this recommender system:
    • University data: list of courses offered by the university or similar institutions; program/course catalog.
    • Student data: transcripts, enrollment history, GPA, interests (collected with consent).
    • Job market data: current demand signals, job postings, required skills from job platforms.
  • Legal and financial considerations:
    • Licensing costs for data or data-collection services.
  • Data privacy and security considerations:
    • Data privacy is critical (FERPA compliance for student data).
    • Consideration of who can access data and how it is stored/transferred.
    • The risk of re-identification or leakage of private information (e.g., student IDs, GPAs, interests).
  • Data collection methods and tools:
    • Web scraping as a tool to collect data from public sources (e.g., job postings, course catalog pages) when allowed.
    • Surveys or consent-based collection for sensitive fields (e.g., student interests).
    • Data transfer protocols and secure storage practices.
  • Practical note: data collection decisions should align with privacy and security requirements and ensure that the data is legally usable for model development.

Step 3: Data processing and quality assessment

  • Data quality challenges commonly encountered:
    • Missing values in both student data and job market data.
    • Inconsistent formats (e.g., dates in different formats across sources).
    • Imbalanced data: e.g., many samples for CS courses vs. few samples for others, leading to bias in recommendations.
  • Example data quality issues shown in the talk:
    • Two sample tables with missing rows and inconsistent date formats.
    • Potential bias: if there are far more samples for one domain (e.g., CS), the model may overfit to that domain and underrepresent others.
  • The importance of data completeness and richness:
    • If the dataset is not rich enough, consider adding features to increase expressiveness (e.g., a salary column could be enriched with a median salary per role; adding a feature like days since a job post was posted).
    • This step is optional if the data is already sufficiently complete, but it can significantly improve model learning.
  • Possible actions if data is incomplete or biased:
    • Collect more data, engineer additional features, or consider resampling to balance classes.
    • Revisit data sources or definitions to reduce ambiguity.
  • Tools and practices referenced:
    • Data handling with pandas (e.g., handling missing values, normalizing formats).
  • Can this step be skipped? Generally not, because data quality directly impacts model performance and fairness. Skipping can introduce fatal errors or biased outcomes.

Step 4: Data splitting

  • Purpose: to evaluate model performance on unseen data and prevent overfitting.
  • Recommended data split scheme (as described in the talk):
    • Training set: N_{train} = 0.7 N
    • Validation set: N_{val} = ext{between }0.15N ext{ and }0.25N
    • Test set: N{test} = N - N{train} - N_{val} ext{ (the remaining portion)}
  • Rationale: training data is used to learn patterns, validation data is used to challenge learning and tune hyperparameters, and test data provides an unbiased assessment of performance on completely unseen examples.
  • Note on variability: the exact split can vary (e.g., 70/15/15 or 70/20/10) depending on data size and project needs.
  • Question addressed in the talk: can we skip data splitting? No, splitting is essential for a reliable evaluation and to avoid leaking information from training into testing.

Step 5: Model selection and training

  • The scenario focuses on generating lists of courses (recommendations) rather than a simple classification or regression task.
  • Strategy for model selection:
    • Start with open-source, accessible models suitable for recommenders or ranking tasks.
    • Do not overcomplicate with heavyweight customization before establishing a baseline.
  • Training focus: teach the model to produce a ranked list of courses given input features (e.g., subject, program, interests).
  • Important caveat: this session does not dive into deep taxonomy of classification vs. regression; the emphasis is on a practical pipeline for a recommendation task.

Step 6: Model evaluation

  • Why evaluation is necessary:
    • To ensure the model generalizes to unseen data and provides outputs that are valuable to users.
    • A well-trained model that performs poorly in real use is not useful; evaluation helps decide if deployment is warranted.
  • Evaluation methods illustrated:
    • Use training vs. validation loss charts to monitor learning and overfitting.
    • Determine whether performance on the validation/test data meets the required quality before deployment.
  • If evaluation signals poor performance:
    • Return to earlier steps (e.g., data collection/processing) to improve data quality.
    • Consider adjusting features, collecting more data, or changing the modeling approach.

Step 7: Deployment and web-based deployment considerations

  • Deployment scenario described as a web-based application that could run in the cloud.
  • Considerations for deployment:
    • Where will the model be hosted and served (cloud platform choices, latency, availability)?
    • How to ensure reliability and scalability as user load grows (e.g., 10k+ concurrent users).
    • Backup plan (Plan B) in case of failures or outages.
  • Infrastructure and tooling:
    • Choice of cloud provider and deployment architecture.
    • Monitoring and observability to detect performance degradation.
  • Security and privacy implications continue to be important in deployment; ensure secure APIs and access controls.

Step 8: Monitoring, updating, and continuous improvement

  • Why updates are necessary:
    • Job market dynamics change over time; course offerings and student interests evolve; models can become stale if not updated.
  • Update strategy considerations:
    • Continuous monitoring of job market signals and student data; detect when changes warrant retraining or data collection updates.
    • Version control for models and datasets to manage evolution over time.
  • Version control and deployment/versioning strategies mentioned:
    • Use naming conventions or GitHub to track different versions (e.g., versioned model artifacts).
    • Plan for managing multiple versions of a trained model and data schemas.

Step 9: The cycle and ongoing nature of ML projects

  • The ML workflow is cyclical, not linear:
    • After deployment, monitor performance, collect new data, and revisit earlier steps as needed.
    • Each cycle helps ensure the system remains relevant and reliable.
  • The practical takeaway: treat ML projects as iterative, living systems rather than one-off implementations.

Step 10: Related anecdotes, examples, and tangents

  • Generative AI model example:
    • A company wanted synonyms for a generative AI task; used a pre-trained model and post-processing to achieve results due to the difficulty of running massive models locally.
  • Image processing and perceptual systems (Netflix example):
    • Pixels are numbers; images are processed by manipulating numerical representations.
    • Netflix uses scene analysis to optimize streaming quality: if bandwidth is limited, it optimizes what to display to minimize buffering by focusing on regions of interest (e.g., backgrounds may be deprioritized).
  • Real-world parallels and learning metaphors:
    • Universities use exams to assess performance; similarly, ML models need evaluation to ensure they have learned to perform on unseen data.
  • Practical caution: many of these ideas require careful consideration of data privacy and ethical implications, especially when handling student data or job market data.

Ethical, philosophical, and practical implications

  • Data privacy and security:
    • FERPA and related privacy protections when handling student records and identifying information.
    • Data minimization and secure storage/transfer practices to prevent leakage of sensitive attributes (e.g., student IDs, GPAs, interests).
  • Equity and bias considerations:
    • Imbalanced data can bias recommendations toward overrepresented groups or domains (e.g., CS courses).
    • Continuous auditing for fairness and accessibility across different student populations.
  • Transparency and explainability:
    • Stakeholders need understandable explanations for why certain courses are recommended; consider model-interpretability practices.
  • Compliance and licensing:
    • Ensure data sources and tools comply with licenses and terms of use; budget for data access costs.

Summary of key concepts and takeaways

  • The two main criteria for course recommendations: student strengths and current market demand.
  • The ML workflow is universal and cyclical; problem definition is foundational and non-negotiable.
  • Data collection requires careful consideration of data sources, privacy (FERPA), and licensing costs.
  • Data processing emphasizes data quality: missing values, inconsistent formats, and potential bias.
  • Data splitting typically follows a train/validation/test scheme (commonly around 70\%/15-25\%/15-25\%), to evaluate generalization.
  • Model selection for a recommender focuses on practical, open-source approaches rather than deep textbook classifications; prioritize a robust baseline.
  • Evaluation uses diagnostic plots (e.g., training vs validation loss) and external criteria to judge usefulness to end users.
  • Deployment requires cloud considerations, reliability under load, and a contingency plan.
  • Updating is essential for aligning with evolving job markets and curricula; use version control to track changes.
  • Real-world examples emphasize data privacy, ethical considerations, and the limits of machine understanding (e.g., interpretation of data as numbers, not semantic comprehension).

LaTeX references used in context of the notes

  • Data split specification: N{train} = 0.7 N,\, N{val} = \alpha N,\, N{test} = N - N{train} - N_{val},\quad \alpha \in [0.15,0.25].
  • Input/output example: inputs include \text{subject name},\text{program name},\text{list of interests}; outputs: a list of recommended courses (e.g., \text{Applied machine learning},\, \text{Cloud computing}).
  • Conceptual relationships: the two primary criteria for recommendations can be described qualitatively as \text{Student strengths} \quad \&\quad \text{Market demand}.