Preemptible VMs for Cost Reduction
Preemptible VMs are Compute Engine instances that last a maximum of 24 hours and provide no availability guarantees, but are priced lower than standard VMs[1]. They can be used with Google Kubernetes Engine (GKE) to set up clusters or node pools with GPUs attached, making them ideal for ML workflows with flexible completion times[1].
Kubeflow Pipelines and Preemptible VMs
Kubeflow is an open-source project for deploying ML workflows on Kubernetes. Kubeflow Pipelines allows users to build and deploy scalable ML workflows based on Docker containers[1]. When running Kubeflow on GKE, users can now define pipeline steps to run on preemptible nodes, reducing job costs[1].
Implementation Guidelines
1. Setup: Create a preemptible, GPU-enabled node pool in your GKE cluster[1].
2. Pipeline Definition: Modify pipeline steps to use preemptible nodes and specify retry attempts[1].
3. Idempotency: Ensure preemptible steps are either idempotent or can checkpoint work to resume after interruption[1].
Example: Reducing Model Training Costs
The post provides an example of modifying a Kubeflow pipeline that trains a Tensor2Tensor model on GitHub issue data:
1. Refactor the pipeline to separate data copying and model training steps[1].
2. Create reusable component specifications for Google Cloud Storage copy and TensorFlow training steps[1].
3. Annotate the training operation to run on a preemptible GPU-enabled node with multiple retry attempts[1].
Monitoring and Logging
Stackdriver Monitoring can be used to inspect logs for both current and terminated pipeline operations[1]. The Kubeflow Pipelines dashboard UI shows preempted and restarted training steps, with links to Stackdriver logs for detailed information[1].
Benefits and Use Cases
Preemptible VMs are well-suited for:
- Regularly scheduled training or tuning jobs with flexible completion times
- Large-scale hyperparameter tuning experiments
- Reducing costs in ML workflows
By using preemptible VMs and configuring node pools to autoscale, users can significantly lower costs without paying for idle instances[1].
Citations: