L20_ ML Operations

The ML Project Lifecycle

The ML project lifecycle encompasses several key stages, from understanding the problem to deploying and monitoring models. This iterative process ensures that machine learning projects are effective and aligned with business goals. The lifecycle also includes continuous evaluation and refinement to adapt to changing data and business needs.

Scoping and Feasibility
Understanding the Problem
  • Begins with understanding the context and the big picture. This involves engaging with stakeholders, conducting preliminary research, and documenting existing pain points.

  • Involves identifying existing systems and processes. Documenting the current infrastructure, workflows, and technologies in use.

  • Determines what problems can and should be solved using ML. Not all problems are suitable for ML solutions; this step assesses the feasibility and potential impact of using ML.

Defining Goals and Metrics
  • Clearly defines the objectives of the ML project. Objectives should be specific, measurable, achievable, relevant, and time-bound (SMART).

  • Establishes metrics to measure success. Key performance indicators (KPIs) should be defined to track progress and evaluate the effectiveness of the ML project. Examples include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).

Data
Data Acquisition
  • Data ingestion from various sources. This includes databases, data lakes, APIs, and external datasets. Ensuring compatibility and accessibility of data sources is essential.

  • Data verification to ensure quality and accuracy. Implementing data validation checks to identify and correct errors, inconsistencies, and missing values.

  • Labeling data appropriately for supervised learning. This involves assigning labels or annotations to data points to train supervised learning models. Labeling can be manual, automated, or a combination of both.

Data Preparation
  • Data exploration to understand its characteristics. Exploratory data analysis (EDA) techniques are used to visualize and summarize data, identify patterns, and gain insights.

  • Feature engineering to create relevant input features. Feature engineering involves transforming raw data into features that are suitable for model training. This may include scaling, normalization, encoding categorical variables, and creating new features based on domain knowledge.

  • Preparing data for model training. This includes splitting the data into training, validation, and test sets, as well as applying any necessary preprocessing steps.

Modelling
Model Exploration
  • Exploring different ML models. This involves experimenting with various algorithms and architectures to find the most suitable model for the problem at hand.

  • Fine-tuning models to optimize performance. Hyperparameter tuning techniques are used to optimize the performance of ML models. This may include grid search, random search, and Bayesian optimization.

  • Constructing ensembles of models to improve accuracy. Ensemble methods combine multiple models to improve predictive performance. Common ensemble techniques include bagging, boosting, and stacking.

Deployment and Monitoring
Evaluation and Validation
  • Evaluating model performance using appropriate metrics. Selecting the right evaluation metrics is crucial for assessing the performance of ML models. Metrics should align with the project's goals and objectives.

  • Validating model accuracy and reliability. This involves assessing the generalization performance of the model on unseen data. Techniques such as cross-validation and holdout validation are used to estimate model performance.

  • Analyzing mistakes to identify areas for improvement.

  • Investigating feature importance to understand model behavior. Feature importance analysis helps to identify the most influential features in the model. This information can be used to gain insights into the underlying relationships in the data and guide feature engineering efforts.

Stakeholder Presentation
  • Presenting the model and its results to stakeholders. This involves communicating the key findings, insights, and limitations of the ML project in a clear and concise manner.

Deployment
  • Deploying the model to a production environment. This involves integrating the model into the existing infrastructure and making it available for real-time predictions.

Monitoring and Updates
  • Continuously monitoring model performance. Monitoring model performance in production is essential for detecting and addressing issues such as concept drift and data quality degradation.

  • Updating models and data to maintain accuracy. Regularly updating models with new data helps to maintain accuracy and relevance over time. This may involve retraining models from scratch or fine-tuning existing models with new data.

Business Impact Monitoring
  • Monitoring the impact of the model on business outcomes. This involves tracking key business metrics to assess the impact of the ML model on business objectives. Examples include revenue, customer satisfaction, and operational efficiency.

Version Control
  • Implementing version control for models and data. Version control systems (e.g., Git) are used to track changes to models, data, and code. This enables collaboration, reproducibility, and rollback to previous versions.

Performance Measures

Metrics to evaluate the performance of ML models include:

  • Quality: Accuracy and reliability of the model.

  • Latency and Throughput: Speed and volume of predictions.

  • Development and Maintenance Time: Time and effort required to develop and maintain the model.

  • Usage Cost: Cost associated with using the model.

  • Compliance: Adherence to relevant regulations and standards.

Tools like TensorFlow Serving and Grafana dashboards can be used to monitor performance.

Concept Drift

A typical problem in ML is concept drift, where changes in data cause the model to perform worse over time. For example:

  • New email spam strategies can bypass ML-based spam filters.

  • New trends can disrupt shopping recommender systems.

Detecting drift can be difficult since it relies on testing with new data.

Static vs. Dynamic Training
Static (Offline) Training
  • Train a single model once, deploy it, and use it for a while.

  • Pros: Model can be thoroughly tested, and deployment only needs to be done once.

  • Cons: Data must be unchanging over time.

Dynamic (Online) Training
  • Retrain models continuously when new data comes in, and serve the most recent model.

  • Pros: Model is always up-to-date with shifts/fluctuations in data.

  • Cons: Potentially high cost of training and deployment cycle, and each model may not be thoroughly tested.

ML in Real Life (ML IRL)

ML in real-world applications involves various components:

  • Data Collection: Gathering data from various sources.

  • Feature Extraction: Extracting relevant features from the data.

  • ML Code: Implementing machine learning algorithms.

  • Serving: Deploying the model to make predictions.

  • Monitoring: Tracking the model's performance.

  • Analysis Tools: Tools for analyzing data and model performance.

  • Data Verification: Ensuring data quality.

  • Configuration: Configuring the ML system.

  • Machine Resource Management: Managing computational resources.

  • Process Management Tools: Tools for managing the ML process.

  • Infrastructure: The underlying infrastructure for ML.

MLOps: Machine Learning Operations

MLOps adapts Continuous Integration (CI) and Continuous Delivery (CD) from DevOps to include data processing and adds Continuous Training (CT).

There are no hard rules on how to organize your ML workflow; MLOps is a popular philosophy.

MLOps Levels
Level 0: Manual Processes
  • All processes are done manually.

  • The starting point for any ML project.

  • Data preparation, model training, validation, etc., are performed by hand.

  • Most code is experimental.

  • Model development and deployment are completely separate.

  • Infrequent release iterations.

Level 1: Pipeline Automation
  • Set up automated processes for data preparation and model training.

  • Automatically retrain models when new data is available (CT).

  • Run standardized performance tests.

  • Deploy new model versions.

  • At level 0, a final model is deployed, while at level 1, an entire training pipeline is deployed.

Level 2: CI/CD Pipeline Automation
  • Allows new model development to be integrated quickly.

  • Requires more support structure:

  • Integration testing

  • Additional model metrics (latency, throughput, etc.)

  • Model metadata store

  • Advanced pipeline orchestration