Instructions to use codellama/CodeLlama-13b-Instruct-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use codellama/CodeLlama-13b-Instruct-hf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="codellama/CodeLlama-13b-Instruct-hf")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-13b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-13b-Instruct-hf")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use codellama/CodeLlama-13b-Instruct-hf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codellama/CodeLlama-13b-Instruct-hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-13b-Instruct-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/codellama/CodeLlama-13b-Instruct-hf

SGLang

How to use codellama/CodeLlama-13b-Instruct-hf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "codellama/CodeLlama-13b-Instruct-hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-13b-Instruct-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "codellama/CodeLlama-13b-Instruct-hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-13b-Instruct-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use codellama/CodeLlama-13b-Instruct-hf with Docker Model Runner:
```
docker model run hf.co/codellama/CodeLlama-13b-Instruct-hf
```

Issues while deploying on AWS SageMaker with TGI

by rajaswa-postman - opened Sep 7, 2023

Discussion

rajaswa-postman

Sep 7, 2023

I've been trying to deploy codellama/CodeLlama-13b-Instruct-hf on AWS SageMaker with the TGI container for a while now. I am facing two issues in particular -

The tokenizer class mismatch -

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'. 
The class this function is called from is 'LlamaTokenizer'.

Model loading error with TGI -

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner model = get_model( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model return FlashLlama( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 65, in __init__ model = FlashLlamaForCausalLM(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 452, in __init__ self.model = FlashLlamaModel(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 390, in __init__ [ File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 391, in <listcomp> FlashLlamaLayer( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 326, in __init__ self.self_attn = FlashLlamaAttention( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 183, in __init__ self.rotary_emb = PositionRotaryEmbedding.load( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 395, in load inv_freq = weights.get_tensor(f"{prefix}.inv_freq") File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 62, in get_tensor filename, tensor_name = self.get_filename(tensor_name) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight model.layers.0.self_attn.rotary_emb.inv_freq does not exist

Any idea about how these can be resolved?

I have tried using the latest transformers version - 4.33.1 as well.

lvwerra

Code Llama org Sep 7, 2023

cc @philschmid

d4niel92

Sep 9, 2023

•

edited Sep 16, 2023

Same issue here ... Any help would be greatly appreciated

I already tried to pip install different transformer versions, but none of them was able to fix the problem.

!pip install git+https://github.com/huggingface/transformers.git@main
!pip install git+https://github.com/ArthurZucker/transformers.git@main
!pip install git+https://github.com/ArthurZucker/transformers.git@add-llama-code

ArthurZ

Code Llama org Sep 18, 2023

You should only need pip install git+https://github.com/huggingface/transformers.git@main my branch was just for developpement

Narsil

Sep 18, 2023

This warning is safe to ignore.

Both tokenizer are the same (for TGI purposes) as TGI doesn't use the codellama in code capabilities, you would need to send the preprompt yourself.
For the missing inv_freq codellama's weights didn't include those (essentially it's llamav2) and old TGI versions expected inv_freq to be present.

This should all be solved with the upcoming Sagemaker release of latest TGI.

d4niel92

Sep 19, 2023

Thanks for your reply, @Narsil ! Any information on when the upcoming Sagemaker release of the latest TGI will be available?

Narsil

Sep 19, 2023

Soon I hope, but I can't make any promises (it's not in our hands at this point)

philschmid

Sep 19, 2023

1.0.3 is now available on SageMaker.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment