RLHF Checkpoints from Clipping Free Policy Optimization for Large Language Models
-
asparius/Qwen2.5-7B-Instruct-GRPO-1ep-iter16
Text Generation • 8B • Updated • 13 -
asparius/Qwen2.5-7B-Instruct-GRPO-1ep-iter8
Text Generation • 8B • Updated • 9 -
asparius/Qwen2.5-7B-Instruct-GRPO-1ep-iter4
Text Generation • 8B • Updated • 13 -
asparius/Qwen2.5-7B-Instruct-GRPO-1ep-iter2
Text Generation • 8B • Updated • 10