5.3 Fine Tuning Code Walkthrough Notes

Jupyter Notebook

  • Web-based IDE for Python.
  • Allows code execution block by block.

PyTorch

  • Open-source framework widely used for machine learning.
  • Alternatives include TensorFlow and JAX.

Google Colab

  • Cloud environment for hosting Jupyter notebooks.
  • Provides runtime environment including GPU allocation.
  • Offers different subscription plans (monthly, pay-as-you-go, Pro, Pro+).

Setting up Colab Environment

  • Install transformers library if not already present.
  • Troubleshooting: Install huggingface_hub first using pip install huggingface_hub.
  • Login to Hugging Face using an access token from your account.
  • Install git-lfs for handling large files in version control.
  • Import transformers and other necessary libraries.
  • Troubleshooting: Resolve import errors by reinstalling or re-importing libraries.

Accessing Hugging Face

  • Create a free account on Hugging Face.
  • Obtain an access token from your Hugging Face account settings.
  • Use the token to log in and communicate with Hugging Face for data access.

Dataset Preparation

  • Load dataset (e.g., Wikitext) using load_dataset.
  • Inspect data using indexing (e.g., dataset['train'][1]).
  • Display random samples from the dataset using a function like show_random_element.

Causal Language Modeling (CLM)

  • Utilizes auto-regressive models like GPT.
  • Models predict the next token based on previous tokens.
  • GPT-4 uses distillation for compactness.
  • Use a checkpoint model (DistilGPT2) from Hugging Face.

Tokenization

  • Use auto-tokenizer associated with the model.
  • Map tokens to input IDs using the map method.
  • Optimize preprocessing using batch=True.
  • Examine tokenized data to see input IDs and attention masks.

Attention Mask

  • Indicates the relevance of tokens.
  • Value of 1 means the token is relevant.
  • Larger attention mask corresponds to more relevant text.

Data Chunking

  • Split data into smaller chunks to fit GPU memory.
  • Define a block size.
  • Preprocess data to group text into cohesive sequences.
  • Split data into smaller chunks using the map method.

Decoding

  • Convert token IDs back to text for verification.
  • Clean data using tokenization and embedding.

Model Training

  • Import auto-regressive model from transformers.
  • Install torch if needed.
  • Import Accelerator for faster training.

Troubleshooting Training Issues

  • Address potential import errors by re-importing transformers.
  • Resolve issues by restarting the environment and re-executing code blocks.
  • Ensure correct installation and importing of necessary libraries.

Fine-Tuning Process

  • Specify data size and group text into smaller chunks.
  • Normalize data and use a decoder to clean it.
  • Train the model using the prepared and tokenized dataset.
  • Monitor GPU usage during training.

Debugging Tips

  • Debugging helps in understanding the code better.
  • Address code issues by understanding the error messages.

Training Parameters

  • Set up necessary components before training.
  • Epochs: Multiple passes through the training dataset.
  • Iterations: Number of batches processed during training.
  • Training loss: Measure of error during training.
  • Validation loss: Measure of error on validation dataset.

Computing Unit Usage

  • Monitor available computing units in Google Colab.
  • Training decreases available computing units.

Hyperparameters

  • Learning Rate: Controls how fast the model learns.
  • Adjust learning rate for faster convergence.

Hugging Face Hub

  • Can host the fine-tuned model for inference.
  • Evaluate the model after training to see how well it performs (e.g. 38%).

Code Review

  • Import Hugging Face Hub and datasets.
  • Install necessary libraries.
  • Load dataset and prepare it for training.

Datasets

  • Load YUM dataset or other desired datasets.
  • Specify data splits (training, validation, testing).
  • Display random elements from the dataset.

Auto-regressive Models

  • Download pre-trained models (e.g., GPT-2) from Hugging Face.
  • Use distilled versions (DistilGPT2) for efficiency.
  • Tokenize data and group it together.

Key Steps

  • Tokenize data using MapReduce.
  • Embed data with input IDs and attention masks.
  • Train the model.

Loss Function

  • Used to compare the predicted output with the actual value.

Cloud Providers

  • Paperspace (now part of DigitalOcean) offers cloud GPUs.
  • Google Colab is a popular choice due to its free version.