ไฝฟ็จ DPO ๅพฎ่ฐ Llama 2


- +1
DPO fine-tuned Llama-2 7B model. The model is designed to generate human-like responses to questions in Stack Exchange domains of programming, mathematics, physics, and more. For more info check out the blog post and github example.
Original datasets are described in the LLaMA Model Card. Fine-tuning datasets for this model are based on Stack Exchange Paired, which consists of questions and answers from various domains in Stack Exchange, such as programming, mathematics, physics, and more. Specifically:
Traditional Fine-tuning: https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/finetune
DPO Training: https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/rl
The model was first fine-tuned on the Stack Exchange question and answer pairs and then fine-tuned via the DPO training procedure using the SFT model as the reference model. It is trained to respond to prompts with the following prompt template:
Question: <Query>
Answer: <Response>