Fine-Tuning Llama 3: Mastering Customization for AI Projects

Wednesday, July 17, 2024 by adebisi_oluwatomiwa878

🚀 Fine-Tuning Llama 3: Mastering Customization for AI Projects

Welcome to this tutorial on fine-tuning the Llama 3 model for various tasks! My name is Tommy, and I'll be guiding you through this tutorial designed to equip you with the skills needed to fine-tune a state-of-the-art generative model using real-world datasets. By the end of this tutorial, you'll be ready to apply your knowledge in AI hackathons and other exciting projects.

Objectives 📋

In this tutorial, we'll cover:

The process of fine-tuning Llama 3 for various tasks using customizable datasets.
Using the Unsloth implementation of Llama 3 for its efficiency.
Leveraging Hugging Face's tools for model handling and dataset management.
Adapting the fine-tuning process to your specific needs, allowing you to fine-tune Llama 3 for any task.

Prerequisites 🛠️

Basic understanding of transformers
Familiarity with Python programming
Access to Google Colab
Basic knowledge of fine-tuning models

Setting Up the Environment 🖥️

Google Colab ⚙️

To get started, open Google Colab and create a new notebook. Make sure to enable GPU support for faster training. You can do this by navigating to Edit > Notebook settings and selecting T4 GPU as the hardware accelerator. Ensure you select T4 GPU for optimal performance.

Installing Dependencies 📦

In your Colab notebook, run the following command to install the necessary libraries:

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Loading the Pre-trained Model 📚

We'll use the Unsloth implementation of Llama 3, which is optimized for faster training and inference.

Note: If you're using a gated model from Hugging Face, you will need to add the field "token" to FastLanguageModel.from_pretrained with your Hugging Face access token.

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Choose any value till 8000 for Llama 3
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # token="YOUR_HUGGINGFACE_ACCESS_TOKEN"  # Add this line if using a gated model
)

Preparing the Dataset 📊

First, upload your dataset.json file to Google Colab with the following content to train model for sentiment analysis:

[
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "I love the new features of this product!",
    "output": "positive"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "The weather is okay, nothing special.",
    "output": "neutral"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "I'm really disappointed with the service.",
    "output": "negative"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "The movie was fantastic and thrilling!",
    "output": "positive"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "I don't mind waiting, it's not a big deal.",
    "output": "neutral"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "The food was terrible and tasteless.",
    "output": "negative"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "Had a great time at the park today!",
    "output": "positive"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "The book was quite boring and slow.",
    "output": "negative"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "It's just another day, nothing much happened.",
    "output": "neutral"
  },
  {
    "instruction": "Classify the sentiment of the following text.",
    "input": "The customer support was very helpful.",
    "output": "positive"
  }
]

Next, Define the prompt to be used in conjunction with the dataset for fine-tuning. Then load the dataset from the uploaded dataset.json file:

from datasets import load_dataset

fine_tuned_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(prompt_dict):
    instructions = prompt_dict["instruction"]
    inputs       = prompt_dict["input"]
    outputs      = prompt_dict["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = fine_tuned_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

# Load dataset from the local JSON file
dataset = load_dataset('json', data_files='dataset.json', split='train')
dataset = dataset.map(formatting_prompts_func, batched = True)

Fine-Tuning the Model 🔧

We'll use LoRA (Low-Rank Adaptation) to fine-tune the model efficiently. LoRA helps in adapting large models by inserting trainable low-rank matrices into each layer of the Transformer architecture.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for 30% less VRAM usage
)

Parameters Explanation 📝

r: Rank of the low-rank approximation, set to 16 for a good balance between performance and memory usage.
target_modules: Specifies which modules LoRA is applied to, focusing on the most critical parts of the model.
lora_alpha: Scaling factor for LoRA weights, set to 16 for stable training.
lora_dropout: Dropout rate applied to LoRA layers, set to 0 for no dropout.
bias: Indicates how biases are treated, set to "none" meaning biases are not trained.
use_gradient_checkpointing: Reduces memory usage by storing intermediate activations.

Training 🏋️

We will use Hugging Face’s SFTTrainer to train the model.

from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
    output_dir="./results",
    # num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    max_steps = 60,
    learning_rate = 2e-4,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length = max_seq_length,
)

trainstats = trainer.train()

TrainingArguments Parameters used:

output_dir: The directory where the trained model and checkpoints will be saved. It is essential for resuming training and sharing the model.
per_device_train_batch_size: The batch size to use for training on each device. This affects the memory usage and training speed.
save_steps: The number of steps between each save of the model. This helps in resuming training from the last checkpoint in case of interruptions.
save_total_limit: The maximum number of checkpoints to keep. Older checkpoints will be deleted, which helps in managing disk space.
gradient_accumulation_steps: The number of steps to accumulate gradients before performing a backward pass. This is useful for large models that cannot fit into the GPU memory with a larger batch size.
warmup_steps: The number of steps to perform a learning rate warmup. This helps in stabilizing the training process.
max_steps: The total number of training steps. Training will stop after reaching this number.
learning_rate: The learning rate to use for training. This controls the size of the updates to the model's weights.
fp16: Whether to use 16-bit (half-precision) floating-point numbers during training, which can reduce memory usage and speed up training on GPUs that support it.
bf16: Whether to use bfloat16 (brain floating point) precision, which can be beneficial on certain hardware like TPUs.

SFTTrainer Parameters used:

model: The model to be trained.
args: The TrainingArguments that define the training configuration.
train_dataset: The dataset to use for training.
tokenizer: The tokenizer used to process the data. It is essential for converting text to input tensors.
dataset_text_field: The name of the field in the dataset that contains the text to be used for training.
max_seq_length: The maximum length of the sequences to be fed into the model. Sequences longer than this will be truncated.

Using the Fine-Tuned Model 🧠

Now the model is trained, let's try it out on some sample inputs to test the sentiment analysis task:

Inference is the process of using a trained model to make predictions on new data.

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    fine_tuned_prompt.format(
        "Classify the sentiment of the following text.", # instruction
        "I dislike playing football under the rain", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
outputs = tokenizer.decode(outputs[0])
print(outputs)

There are two ways to save your fine-tuned model which are:

Saving the Model Locally

model.save_pretrained("path/to/save")
tokenizer.save_pretrained("path/to/save")

Saving the Model to Hugging Face Hub (Online)

model.push_to_hub("your_username/your_model_name", token = "YOUR_HUGGINGFACE_ACCESS_TOKEN")
tokenizer.push_to_hub("your_username/your_model_name", token = "YOUR_HUGGINGFACE_ACCESS_TOKEN")

Conclusion 🎉

And with that, you should be well-equipped to fine-tune the Llama 3 model for a variety of tasks. By mastering these techniques, you’ll be able to tailor the model to your specific needs, enabling you to tackle AI projects with greater efficiency and precision. Best of luck with your fine-tuning endeavors and exciting AI projects ahead! 🚀

Fine-Tuning Llama 3: Mastering Customization for AI Projects