Instructions to use codellama/CodeLlama-13b-Instruct-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codellama/CodeLlama-13b-Instruct-hf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="codellama/CodeLlama-13b-Instruct-hf") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-13b-Instruct-hf") model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-13b-Instruct-hf") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use codellama/CodeLlama-13b-Instruct-hf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "codellama/CodeLlama-13b-Instruct-hf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-13b-Instruct-hf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/codellama/CodeLlama-13b-Instruct-hf
- SGLang
How to use codellama/CodeLlama-13b-Instruct-hf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "codellama/CodeLlama-13b-Instruct-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-13b-Instruct-hf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "codellama/CodeLlama-13b-Instruct-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-13b-Instruct-hf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use codellama/CodeLlama-13b-Instruct-hf with Docker Model Runner:
docker model run hf.co/codellama/CodeLlama-13b-Instruct-hf
Issues while deploying on AWS SageMaker with TGI
I've been trying to deploy codellama/CodeLlama-13b-Instruct-hf on AWS SageMaker with the TGI container for a while now. I am facing two issues in particular -
- The tokenizer class mismatch -
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
- Model loading error with TGI -
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner model = get_model( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model return FlashLlama( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 65, in __init__ model = FlashLlamaForCausalLM(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 452, in __init__ self.model = FlashLlamaModel(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 390, in __init__ [ File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 391, in <listcomp> FlashLlamaLayer( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 326, in __init__ self.self_attn = FlashLlamaAttention( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 183, in __init__ self.rotary_emb = PositionRotaryEmbedding.load( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 395, in load inv_freq = weights.get_tensor(f"{prefix}.inv_freq") File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 62, in get_tensor filename, tensor_name = self.get_filename(tensor_name) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight model.layers.0.self_attn.rotary_emb.inv_freq does not exist
Any idea about how these can be resolved?
I have tried using the latest transformers version - 4.33.1 as well.
cc @philschmid
Same issue here ... Any help would be greatly appreciated
I already tried to pip install different transformer versions, but none of them was able to fix the problem.
!pip install git+https://github.com/huggingface/transformers.git@main
!pip install git+https://github.com/ArthurZucker/transformers.git@main
!pip install git+https://github.com/ArthurZucker/transformers.git@add-llama-code
You should only need pip install git+https://github.com/huggingface/transformers.git@main my branch was just for developpement
This warning is safe to ignore.
Both tokenizer are the same (for TGI purposes) as TGI doesn't use the codellama in code capabilities, you would need to send the preprompt yourself.
For the missing inv_freq codellama's weights didn't include those (essentially it's llamav2) and old TGI versions expected inv_freq to be present.
This should all be solved with the upcoming Sagemaker release of latest TGI.
Soon I hope, but I can't make any promises (it's not in our hands at this point)
1.0.3 is now available on SageMaker.