A Step-by-Step Guide to Deploy LLaMa 2 on Your Local Machine…

Vinsmoke Somya
4 min readMar 7, 2024

--

I. Introduction

In the age of generative AI, the landscape includes both premium choices like ChatGPT and OpenAI’s trial Generation API, alongside various freely available open-source Large Language Models (LLMs). Notably, certain open-source models, including Meta’s formidable LLaMa 2, showcase performance comparable to or even surpassing that of ChatGPT, specifically the GPT-3.5 variant. This article introduces a novel technique for incorporating Llama into your local environment, enabling users to personalize it with their own database while addressing data privacy concerns.

The article is the best fit for you if:

  • You want to try running LLaMa 2 on your machine.
  • You are concerned about data privacy when using third-party LLM models.

II. Implement LLMs on your machine

Deploy Llama on your local machine and create a Chatbot.

Running a large language model normally needs a large memory of GPU with a strong CPU, for example, it is about 280GB VRAM for a 70B model, or 28GB VRAM for a 7B model for a normal LLMs (use 32bits for each parameter). However, with most companies, it is too expensive to invest in the strength hardware enough to make LLMs available for production. Fortunately, there is research which trying to decrease the size of LLM models while keeping the performances.

The idea of running LLMs on your local machine is to lower the precision of each parameter, also called a quantize model from the full model (using 32-bit for each parameter) to 8-bit or 4-bit parameters. This way reduces the size of the model 4 or 8 times in comparison with the standard model.

In this guide, I’ll explain the process of implementing LLMs on your personal computer. Initially, ensure that your machine is installed with both GPT4All and Gradio.

!pip install gpt4all
!pip install gradio
!pip install huggingface_hub[cli,torch]

After that, copy this code from GitHub Gist to file main.py:

Or Just Copy This below Code to file main.py:

import os
from threading import Thread
from typing import Iterator

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

MAX_MAX_NEW_TOKENS = 2048
DEFAULT_MAX_NEW_TOKENS = 1024
MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

DESCRIPTION = """\
# Llama-2 7B Chat
This Space demonstrates model [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat) by Meta, a Llama 2 model with 7B parameters fine-tuned for chat instructions. Feel free to play with it, or duplicate to run generations without a queue! If you want to run your own service, you can also [deploy the model on Inference Endpoints](https://huggingface.co/inference-endpoints).
🔎 For more details about the Llama 2 family of models and how to use them with `transformers`, take a look [at our blog post](https://huggingface.co/blog/llama2).
🔨 Looking for an even more powerful model? Check out the [13B version](https://huggingface.co/spaces/huggingface-projects/llama-2-13b-chat) or the large [70B model demo](https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI).
"""

LICENSE = """
<p/>
---
As a derivate work of [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat) by Meta,
this demo is governed by the original [license](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/main/LICENSE.txt) and [acceptable use policy](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/main/USE_POLICY.md).
"""

if not torch.cuda.is_available():
DESCRIPTION += "\n<p>Running on CPU 🥶 This demo does not work on CPU.</p>"


if torch.cuda.is_available():
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.use_default_system_prompt = False


def generate(
message: str,
chat_history: list[tuple[str, str]],
system_prompt: str,
max_new_tokens: int = 1024,
temperature: float = 0.6,
top_p: float = 0.9,
top_k: int = 50,
repetition_penalty: float = 1.2,
) -> Iterator[str]:
conversation = []
if system_prompt:
conversation.append({"role": "system", "content": system_prompt})
for user, assistant in chat_history:
conversation.extend([{"role": "user", "content": user}, {"role": "assistant", "content": assistant}])
conversation.append({"role": "user", "content": message})

input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt")
if input_ids.shape[1] > MAX_INPUT_TOKEN_LENGTH:
input_ids = input_ids[:, -MAX_INPUT_TOKEN_LENGTH:]
gr.Warning(f"Trimmed input from conversation as it was longer than {MAX_INPUT_TOKEN_LENGTH} tokens.")
input_ids = input_ids.to(model.device)

streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
generate_kwargs = dict(
{"input_ids": input_ids},
streamer=streamer,
max_new_tokens=max_new_tokens,
do_sample=True,
top_p=top_p,
top_k=top_k,
temperature=temperature,
num_beams=1,
repetition_penalty=repetition_penalty,
)
t = Thread(target=model.generate, kwargs=generate_kwargs)
t.start()

outputs = []
for text in streamer:
outputs.append(text)
yield "".join(outputs)


chat_interface = gr.ChatInterface(
fn=generate,
additional_inputs=[
gr.Textbox(label="System prompt", lines=6),
gr.Slider(
label="Max new tokens",
minimum=1,
maximum=MAX_MAX_NEW_TOKENS,
step=1,
value=DEFAULT_MAX_NEW_TOKENS,
),
gr.Slider(
label="Temperature",
minimum=0.1,
maximum=4.0,
step=0.1,
value=0.6,
),
gr.Slider(
label="Top-p (nucleus sampling)",
minimum=0.05,
maximum=1.0,
step=0.05,
value=0.9,
),
gr.Slider(
label="Top-k",
minimum=1,
maximum=1000,
step=1,
value=50,
),
gr.Slider(
label="Repetition penalty",
minimum=1.0,
maximum=2.0,
step=0.05,
value=1.2,
),
],
stop_btn=None,
examples=[
["Hello there! How are you doing?"],
["Can you explain briefly to me what is the Python programming language?"],
["Explain the plot of Cinderella in a sentence."],
["How many hours does it take a man to eat a Helicopter?"],
["Write a 100-word article on 'Benefits of Open-Source in AI research'"],
],
)

with gr.Blocks(css="style.css") as demo:
gr.Markdown(DESCRIPTION)
gr.DuplicateButton(value="Duplicate Space for private use", elem_id="duplicate-button")
chat_interface.render()
gr.Markdown(LICENSE)

if __name__ == "__main__":
demo.queue(max_size=20).launch(server_name='0.0.0.0')

Run the below command to start the server.

python main.py

The model will be available on: http://0.0.0.0:7860/

III. Conclusion

Utilize the deployment of LLaMa 2 as a catalyst for professional innovation. Harness its diverse capabilities to craft superior solutions, enhance your skills, and foster a culture of ongoing learning within your endeavors.

Thank you for reading…

Contact Info…

LinkedInKaggleHuggingFaceTwitter / XGitHub

--

--

Vinsmoke Somya
Vinsmoke Somya

No responses yet