5.3 Fine Tuning Code Walkthrough Notes
Jupyter Notebook
- Web-based IDE for Python.
- Allows code execution block by block.
PyTorch
- Open-source framework widely used for machine learning.
- Alternatives include TensorFlow and JAX.
Google Colab
- Cloud environment for hosting Jupyter notebooks.
- Provides runtime environment including GPU allocation.
- Offers different subscription plans (monthly, pay-as-you-go, Pro, Pro+).
Setting up Colab Environment
- Install
transformers library if not already present. - Troubleshooting: Install
huggingface_hub first using pip install huggingface_hub. - Login to Hugging Face using an access token from your account.
- Install
git-lfs for handling large files in version control. - Import
transformers and other necessary libraries. - Troubleshooting: Resolve import errors by reinstalling or re-importing libraries.
Accessing Hugging Face
- Create a free account on Hugging Face.
- Obtain an access token from your Hugging Face account settings.
- Use the token to log in and communicate with Hugging Face for data access.
Dataset Preparation
- Load dataset (e.g., Wikitext) using
load_dataset. - Inspect data using indexing (e.g.,
dataset['train'][1]). - Display random samples from the dataset using a function like
show_random_element.
Causal Language Modeling (CLM)
- Utilizes auto-regressive models like GPT.
- Models predict the next token based on previous tokens.
- GPT-4 uses distillation for compactness.
- Use a checkpoint model (DistilGPT2) from Hugging Face.
Tokenization
- Use auto-tokenizer associated with the model.
- Map tokens to input IDs using the
map method. - Optimize preprocessing using
batch=True. - Examine tokenized data to see input IDs and attention masks.
Attention Mask
- Indicates the relevance of tokens.
- Value of 1 means the token is relevant.
- Larger attention mask corresponds to more relevant text.
Data Chunking
- Split data into smaller chunks to fit GPU memory.
- Define a block size.
- Preprocess data to group text into cohesive sequences.
- Split data into smaller chunks using the
map method.
Decoding
- Convert token IDs back to text for verification.
- Clean data using tokenization and embedding.
Model Training
- Import auto-regressive model from
transformers. - Install
torch if needed. - Import
Accelerator for faster training.
Troubleshooting Training Issues
- Address potential import errors by re-importing
transformers. - Resolve issues by restarting the environment and re-executing code blocks.
- Ensure correct installation and importing of necessary libraries.
Fine-Tuning Process
- Specify data size and group text into smaller chunks.
- Normalize data and use a decoder to clean it.
- Train the model using the prepared and tokenized dataset.
- Monitor GPU usage during training.
Debugging Tips
- Debugging helps in understanding the code better.
- Address code issues by understanding the error messages.
Training Parameters
- Set up necessary components before training.
- Epochs: Multiple passes through the training dataset.
- Iterations: Number of batches processed during training.
- Training loss: Measure of error during training.
- Validation loss: Measure of error on validation dataset.
Computing Unit Usage
- Monitor available computing units in Google Colab.
- Training decreases available computing units.
Hyperparameters
- Learning Rate: Controls how fast the model learns.
- Adjust learning rate for faster convergence.
Hugging Face Hub
- Can host the fine-tuned model for inference.
- Evaluate the model after training to see how well it performs (e.g. 38%).
Code Review
- Import Hugging Face Hub and datasets.
- Install necessary libraries.
- Load dataset and prepare it for training.
Datasets
- Load YUM dataset or other desired datasets.
- Specify data splits (training, validation, testing).
- Display random elements from the dataset.
Auto-regressive Models
- Download pre-trained models (e.g., GPT-2) from Hugging Face.
- Use distilled versions (DistilGPT2) for efficiency.
- Tokenize data and group it together.
Key Steps
- Tokenize data using MapReduce.
- Embed data with input IDs and attention masks.
- Train the model.
Loss Function
- Used to compare the predicted output with the actual value.
Cloud Providers
- Paperspace (now part of DigitalOcean) offers cloud GPUs.
- Google Colab is a popular choice due to its free version.