Reduce the costs of ML workflows with preemptible VMs and GPUs

Preemptible VMs for Cost Reduction

Preemptible VMs are Compute Engine instances that last a maximum of 24 hours and provide no availability guarantees, but are priced lower than standard VMs[1]. They can be used with Google Kubernetes Engine (GKE) to set up clusters or node pools with GPUs attached, making them ideal for ML workflows with flexible completion times[1].

Kubeflow Pipelines and Preemptible VMs

Kubeflow is an open-source project for deploying ML workflows on Kubernetes. Kubeflow Pipelines allows users to build and deploy scalable ML workflows based on Docker containers[1]. When running Kubeflow on GKE, users can now define pipeline steps to run on preemptible nodes, reducing job costs[1].

Implementation Guidelines

1. Setup: Create a preemptible, GPU-enabled node pool in your GKE cluster[1].

2. Pipeline Definition: Modify pipeline steps to use preemptible nodes and specify retry attempts[1].

3. Idempotency: Ensure preemptible steps are either idempotent or can checkpoint work to resume after interruption[1].

Example: Reducing Model Training Costs

The post provides an example of modifying a Kubeflow pipeline that trains a Tensor2Tensor model on GitHub issue data:

1. Refactor the pipeline to separate data copying and model training steps[1].

2. Create reusable component specifications for Google Cloud Storage copy and TensorFlow training steps[1].

3. Annotate the training operation to run on a preemptible GPU-enabled node with multiple retry attempts[1].

Monitoring and Logging

Stackdriver Monitoring can be used to inspect logs for both current and terminated pipeline operations[1]. The Kubeflow Pipelines dashboard UI shows preempted and restarted training steps, with links to Stackdriver logs for detailed information[1].

Benefits and Use Cases

Preemptible VMs are well-suited for:

- Regularly scheduled training or tuning jobs with flexible completion times

- Large-scale hyperparameter tuning experiments

- Reducing costs in ML workflows

By using preemptible VMs and configuring node pools to autoscale, users can significantly lower costs without paying for idle instances[1].

Citations:

[1] https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en

[2] https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en