A Step-by-Step Guide to Fine-tuning LLaMa-2 on Google Colab…

8 min readMar 2, 2024

LLM’s Training process.

The training process for LLM primarily involves two key steps:

Pre-training: Think of this phase as laying the groundwork for the model’s understanding of language. It’s like teaching the ABCs to a student before they dive into reading complex books. During pre-training, the model gets exposed to tons of text from all over the internet. This helps it grasp the basics of grammar, vocabulary, and common language patterns. As it goes through this phase, the model learns to predict what comes next in a sentence, gaining a solid grasp of language structure.
Fine-tuning which is the crucial next step. After building a foundation in pre-training, the model undergoes a more focused training process. It’s akin to giving tailored lessons to a student to excel in a specific subject at school. For example, fine-tuning might involve sharpening the model’s skills in answering questions or generating code. Essentially, fine-tuning takes the broad language knowledge gained during pre-training and refines it for specific tasks, making the model more precise and effective.
However, even with fine-tuning, there are still challenges. Sometimes, the model produces inaccurate or nonsensical output. It can be sensitive to how input is phrased and may be influenced by biases in the fine-tuning data. Understanding subtle nuances in complex conversations can also be tricky. Plus, generating coherent long-form content, like articles or chatbot responses, can be challenging for the model. These limitations underscore the need for ongoing research and development to improve fine-tuned models, ensuring they’re more reliable and ethically sound for various AI applications.

Reinforcement Learning from Human Feedback (RLHF) serves as a tutor for language models, akin to providing additional guidance after pre-training and fine-tuning. It resembles a teacher reviewing and grading a model’s responses, aiming to further enhance its capabilities. Human feedback, delivered through evaluations and corrections, serves as a means for the model to learn from errors and refine its language skills. Similar to how students improve through feedback in their studies, RLHF assists language models in excelling at specific tasks by incorporating guidance from humans.

Addressing the challenges faced by RLHF, a new technique named Direct Preference Optimization (DPO) is stepping into the game. DPO aims to overcome the limitations of RLHF in fine-tuning large language models (LLMs). Unlike RLHF, which relies on complex learning of reward functions, DPO simplifies the process by treating it as a classification problem using human preference data.

Google Colab limitations:

Fine-tuning a large language model like Llama-2 on Google Colab’s free version comes with notable constraints. The platform’s 12-hour window for code execution, coupled with a session disconnect after just 15–30 minutes of inactivity, poses significant challenges. Moreover, the GPU usage is capped, limiting training to approximately 12 hours per day.

To navigate these limitations, a strategic approach is essential. This involves breaking the training process into smaller chunks, utilizing checkpoints to resume training, and preprocessing data to fit within memory and time constraints. Efficiency becomes paramount, with optimizations such as mixed precision training and code optimization being crucial.

Monitoring progress and metrics, experimenting with hyperparameters, and managing time effectively are all part of the strategy. Batch size and learning rate adjustments, along with patience and persistence, are key elements in this endeavor.

In essence, while fine-tuning Llama-2 on Google Colab’s free tier is feasible, it requires careful planning, efficient resource utilization, and a willingness to adapt to the constraints imposed by the platform.

Google Colaboratory

Edit description

colab.research.google.com

Fine-Tuning Llama 2 step-by-Step

We’re opting to utilize 🦙Llama-2–7B-HF, a pre-trained smaller model within the Llama-2 lineup, for fine-tuning using the Qlora technique.

QLoRA (Quantized Low-Rank Adaptation) serves as an extension of LoRA (Low-Rank Adapters), integrating quantization to enhance parameter efficiency during the fine-tuning process. Notably, QLoRA proves more memory-efficient compared to LoRA by loading the pre-trained model onto GPU memory as 4-bit weights, whereas LoRA uses 8-bit weights. This optimization reduces memory requirements and accelerates computations.

In simpler terms, instead of training the entire model from scratch, we’ll introduce an adapter between the model components and focus solely on training that adapter. This approach allows us to fine-tune the Language Model (LLM) specifically on the consumer GPU, thereby expediting the training process significantly.

Install required packages

!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece bitsandbytes

Import required library

transformers: This library offers APIs to facilitate the download and use of pre-trained models.
bitsandbytes: Designed specifically for quantization purposes, this library focuses on reducing the memory footprint of large language models, particularly on GPUs.
peft: Utilized for integrating LoRA adapters into Language Models (LLMs).
trl: This library houses an SFT (Supervised Fine-Tuning) class that aids in fine-tuning models.
accelerate and xformers: These libraries are employed to enhance the inference speed of the model, thereby optimizing its performance.
wandb: This tool serves as a monitoring platform, used to track and observe the training process.
datasets: Utilized in conjunction with Hugging Face, this library facilitates the loading of datasets.
gradio: This library is employed for the creation of straightforward user interfaces, simplifying the design process.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login

Check system spec

def print_system_specs():
    # Check if CUDA is available
    is_cuda_available = torch.cuda.is_available()
    print("CUDA Available:", is_cuda_available)
# Get the number of available CUDA devices
    num_cuda_devices = torch.cuda.device_count()
    print("Number of CUDA devices:", num_cuda_devices)
    if is_cuda_available:
        for i in range(num_cuda_devices):
            # Get CUDA device properties
            device = torch.device('cuda', i)
            print(f"--- CUDA Device {i} ---")
            print("Name:", torch.cuda.get_device_name(i))
            print("Compute Capability:", torch.cuda.get_device_capability(i))
            print("Total Memory:", torch.cuda.get_device_properties(i).total_memory, "bytes")
    # Get CPU information
    print("--- CPU Information ---")
    print("Processor:", platform.processor())
    print("System:", platform.system(), platform.release())
    print("Python Version:", platform.python_version())
print_system_specs()

Setting the model variable

# Pre trained model
model_name = "meta-llama/Llama-2-7b-hf" 

# Dataset name
dataset_name = "vicgalle/alpaca-gpt4"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "Repository link here"

Log into hugging face hub

notebook_login()

Load dataset

We are utilizing the pre-processed dataset vicgalle/alpaca-gpt4 from Hugging Face.

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train[0:10000]")
dataset["text"][0]

Loading the model and tokenizer

We are going to load a Llama-2–7B-HF pre-trained model with 4-bit quantization, and the computed data type will be BFloat16.

# Load base model(llama-2-7b-hf) and tokenizer
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

Lora config

peft_config = LoraConfig(
    lora_alpha= 8,
    lora_dropout= 0.1,
    r= 16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]
)

Training arguments

training_arguments = TrainingArguments(
    output_dir= "./results",
    num_train_epochs= 1,
    per_device_train_batch_size= 8,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    save_steps= 1000,
    logging_steps= 30,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= False,
    bf16= False,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "linear",
    report_to="wandb",
)

SFTT Trainer arguments

# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

We’re all set to begin the training process.

# Train model
trainer.train()

During this critical phase, it’s essential to keep a vigilant eye on the training loss. Any irregularities or anomalies in the loss pattern serve as a signal to contemplate halting the training process. Overfitting often emerges as a common concern in such instances, necessitating potential adjustments to hyperparameters and the possibility of retrying to attain optimal results . This close monitoring ensures timely intervention and necessary adjustments to enhance the training process.

‘Good training loss’ refers to a situation where the loss metric steadily decreases or reaches a low value during machine learning model training, indicating effective learning from the data. It signifies that the model is capturing relevant patterns and improving its predictive capabilities. However, it’s essential to guard against overfitting, where the model memorizes the training data without generalizing well to new data.

On the other hand, ‘Bad training loss’ indicates undesirable behavior during model training, such as high or increasing loss, fluctuating loss without a clear trend, convergence to a high value, or overfitting. These issues suggest that the model is struggling to learn from the data effectively and may require adjustments to its architecture, hyperparameters, or the quality of the training data. Addressing these problems is crucial for ensuring better model performance and generalization to unseen data.

What after training ?

# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

Let’s test the model

def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    B_INST, E_INST = "### Instruction:\n", "### Response:\n"

    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500))

stream("what is newtons 3rd law and its formula")

Upload a model to hugging face repository

Certainly! Here’s a rephrased version:

Step 1: After completing the training of your model, employing the provided code to release this memory becomes crucial. This action is significant as it aids in preventing your computer from facing memory shortages and can also enhance the performance of other concurrently running programs.

# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()

Step 2: The subsequent step involves merging the adapter with the model.

base_model = AutoModelForCausalLM.from_pretrained(
    model_name, low_cpu_mem_usage=True,
    return_dict=True,torch_dtype=torch.float16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Step 3: Finally, once the merger is complete, the next action involves pushing the merged model to the Hugging Face hub. This process facilitates the sharing and accessibility of the model for others in the community.

model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

Conclusion:

Our evaluation suggests that while the model’s performance is promising, there’s room for improvement to reach outstanding levels. With this understanding, our team remains committed to ongoing Research and Development (R&D) efforts aimed at creating a superior model. Our goal is to offer more effective solutions for Language Models (LLMs) that meet the needs of AI enthusiasts and practitioners.

It’s important to acknowledge the challenges of fine-tuning a model on platforms like Google Colab. Time constraints and resource limitations can present significant hurdles. Nonetheless, our team is actively exploring solutions to overcome these obstacles, with the aim of making fine-tuning on such platforms more accessible and efficient for all users.

In essence, our journey in the realm of LLMs continues, driven by our aspiration to deliver top-notch models and streamline the fine-tuning process. Stay tuned for further updates!

Thank you for reading…

Contact Info…

LinkedIn • Kaggle • HuggingFace • Twitter / X • GitHub