Note

0.0(0)

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

Explore Top Notes

APUSH Unit 1: Early Contact With the New World (1491-1607)-pt.1

Note

Studied by 47 people

5.0(1)

Chapter 4: Adjustments, Financial Statements, and Financial Results

Note

Studied by 94 people

5.0(1)

2.1: introduction to atoms, elements, and ions

Note

Studied by 20 people

5.0(1)

pltw mi : unit 2 study guide

Note

Studied by 157 people

5.0(3)

Chapter 7: Climate and Terrestrial Biodiversity

Note

Studied by 46 people

5.0(1)

Chapter 11: Introduction to Organic Chemistry: Hydrocarbons

Note

Studied by 37 people

5.0(2)

Reduce the costs of ML workflows with preemptible VMs and GPUs

Preemptible VMs for Cost Reduction

Preemptible VMs are Compute Engine instances that last a maximum of 24 hours and provide no availability guarantees, but are priced lower than standard VMs[1]. They can be used with Google Kubernetes Engine (GKE) to set up clusters or node pools with GPUs attached, making them ideal for ML workflows with flexible completion times[1].

Kubeflow Pipelines and Preemptible VMs

Kubeflow is an open-source project for deploying ML workflows on Kubernetes. Kubeflow Pipelines allows users to build and deploy scalable ML workflows based on Docker containers[1]. When running Kubeflow on GKE, users can now define pipeline steps to run on preemptible nodes, reducing job costs[1].

Implementation Guidelines

1. Setup: Create a preemptible, GPU-enabled node pool in your GKE cluster[1].

2. Pipeline Definition: Modify pipeline steps to use preemptible nodes and specify retry attempts[1].

3. Idempotency: Ensure preemptible steps are either idempotent or can checkpoint work to resume after interruption[1].

Example: Reducing Model Training Costs

The post provides an example of modifying a Kubeflow pipeline that trains a Tensor2Tensor model on GitHub issue data:

1. Refactor the pipeline to separate data copying and model training steps[1].

2. Create reusable component specifications for Google Cloud Storage copy and TensorFlow training steps[1].

3. Annotate the training operation to run on a preemptible GPU-enabled node with multiple retry attempts[1].

Monitoring and Logging

Stackdriver Monitoring can be used to inspect logs for both current and terminated pipeline operations[1]. The Kubeflow Pipelines dashboard UI shows preempted and restarted training steps, with links to Stackdriver logs for detailed information[1].

Benefits and Use Cases

Preemptible VMs are well-suited for:

- Regularly scheduled training or tuning jobs with flexible completion times

- Large-scale hyperparameter tuning experiments

- Reducing costs in ML workflows

By using preemptible VMs and configuring node pools to autoscale, users can significantly lower costs without paying for idle instances[1].

Citations:

[1] https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en

[2] https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en

Note