5.3 Fine Tuning Code Walkthrough Notes

Jupyter Notebook

Web-based IDE for Python.
Allows code execution block by block.

PyTorch

Open-source framework widely used for machine learning.
Alternatives include TensorFlow and JAX.

Google Colab

Cloud environment for hosting Jupyter notebooks.
Provides runtime environment including GPU allocation.
Offers different subscription plans (monthly, pay-as-you-go, Pro, Pro+).

Setting up Colab Environment

Install transformers library if not already present.
Troubleshooting: Install huggingface_hub first using pip install huggingface_hub.
Login to Hugging Face using an access token from your account.
Install git-lfs for handling large files in version control.
Import transformers and other necessary libraries.
Troubleshooting: Resolve import errors by reinstalling or re-importing libraries.

Accessing Hugging Face

Create a free account on Hugging Face.
Obtain an access token from your Hugging Face account settings.
Use the token to log in and communicate with Hugging Face for data access.

Dataset Preparation

Load dataset (e.g., Wikitext) using load_dataset.
Inspect data using indexing (e.g., dataset['train'][1]).
Display random samples from the dataset using a function like show_random_element.

Causal Language Modeling (CLM)

Utilizes auto-regressive models like GPT.
Models predict the next token based on previous tokens.
GPT-4 uses distillation for compactness.
Use a checkpoint model (DistilGPT2) from Hugging Face.

Tokenization

Use auto-tokenizer associated with the model.
Map tokens to input IDs using the map method.
Optimize preprocessing using batch=True.
Examine tokenized data to see input IDs and attention masks.

Attention Mask

Indicates the relevance of tokens.
Value of 1 means the token is relevant.
Larger attention mask corresponds to more relevant text.

Data Chunking

Split data into smaller chunks to fit GPU memory.
Define a block size.
Preprocess data to group text into cohesive sequences.
Split data into smaller chunks using the map method.

Decoding

Convert token IDs back to text for verification.
Clean data using tokenization and embedding.

Model Training

Import auto-regressive model from transformers.
Install torch if needed.
Import Accelerator for faster training.

Troubleshooting Training Issues

Address potential import errors by re-importing transformers.
Resolve issues by restarting the environment and re-executing code blocks.
Ensure correct installation and importing of necessary libraries.

Fine-Tuning Process

Specify data size and group text into smaller chunks.
Normalize data and use a decoder to clean it.
Train the model using the prepared and tokenized dataset.
Monitor GPU usage during training.

Debugging Tips

Debugging helps in understanding the code better.
Address code issues by understanding the error messages.

Training Parameters

Set up necessary components before training.
Epochs: Multiple passes through the training dataset.
Iterations: Number of batches processed during training.
Training loss: Measure of error during training.
Validation loss: Measure of error on validation dataset.

Computing Unit Usage

Monitor available computing units in Google Colab.
Training decreases available computing units.

Hyperparameters

Learning Rate: Controls how fast the model learns.
Adjust learning rate for faster convergence.

Hugging Face Hub

Can host the fine-tuned model for inference.
Evaluate the model after training to see how well it performs (e.g. 38%).

Code Review

Import Hugging Face Hub and datasets.
Install necessary libraries.
Load dataset and prepare it for training.

Datasets

Load YUM dataset or other desired datasets.
Specify data splits (training, validation, testing).
Display random elements from the dataset.

Auto-regressive Models

Download pre-trained models (e.g., GPT-2) from Hugging Face.
Use distilled versions (DistilGPT2) for efficiency.
Tokenize data and group it together.

Key Steps

Tokenize data using MapReduce.
Embed data with input IDs and attention masks.
Train the model.

Loss Function

Used to compare the predicted output with the actual value.

Cloud Providers

Paperspace (now part of DigitalOcean) offers cloud GPUs.
Google Colab is a popular choice due to its free version.