✅ 295B total / 21B active / 256K context ✅ Fused fast-and-slow thinking in a single model ✅ First model trained on Hunyuan's rebuilt pretraining + RL infra (Feb → Apr)
Benchmarks: 👉 SWE-Bench Verified, Terminal-Bench 2.0, BrowseComp, WideSearch — competitive results, particularly strong on agentic tool use 👉 Top score on Tsinghua's 2026 Spring math PhD qualifying exam 👉 Strong context-learning and instruction-following on Tencent's CL-bench / CL-bench-Life
Arcade-3B — SmolReasoner NoesisLab/Arcade-3B Arcade-3B is a 3B instruction-following and reasoning model built on SmolLM3-3B. It is the public release from the ARCADE project at NoesisLab, which investigates the State–Constraint Orthogonality Hypothesis: standard Transformer hidden states conflate factual content and reasoning structure in the same subspace, and explicitly decoupling them improves generalization.
Kimi K2 tech report is full of gems as always. Here are my notes on it:
> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher) > Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient. > They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.
With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.
> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once. > They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style > They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.
The infra is also very nice, quick summary: > PP=16 (1F1B schedule, a bit custom), EP=16, zero1 > No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU
🚀 For those who interested in summarization of the long textual reports in medical domain 📝🩺, @Xiaolihai and I delighted to share that we experiment with distillation tuning adaptation for Qwen-2.5 0.5B. We use reports from the MultiClinSum dataset and pass it through 72B version to retrieve report explanations in order to initiate ditillation tuning for 0.5B model. We experiment with passages written in English, French, Portuguese, and Spanish.
🔑 We find that using distil-technique results in 2-4% performance increment on fine-tuning and similar improvements for reports in English (non-official and official evaluation). For the other it results in systems that perform similar to the convential tuning (standard) (see result below).
What is PTS? PTS helps identify specific "pivotal tokens" that dramatically shift the probability of a successful generation. Unlike traditional DPO which treats all tokens equally, PTS focuses optimization on the tokens that actually matter for success.
Inspired by Microsoft's recent Phi-4 paper (which used this technique to achieve SOTA reasoning with only 14B parameters), PTS is especially effective for: - Mathematical reasoning - Coding tasks - Multi-step problem solving - Any domain where specific decision points strongly impact outcomes
1. Open-source code: - Complete implementation of the PTS algorithm - Data generation pipelines - Usage examples and documentation
2. Huggingface resources: - Datasets collection: https://huggingface.co/datasets?other=pts * Pre-generated preference pairs for various domains * Ready to use in your DPO training pipelines
The algorithm is straightforward to implement and can significantly improve your model's reasoning capabilities. Check out the repository for details on getting started!
We welcome feedback, contributions, and collaborations. Let us know if you use PTS in your projects!
Hey! I built an AI Agent to query the FOIA API for a workshop at the Hacks/Hackers Summit in Baltimore and you can do it too!
It’s a quick proof of concept to demo what agents can do, how to design workflows, and how to approach the coding side. TWant a fun project to learn how AI agents work? I built one that queries the FOIA API — and you can too!
It's a quick proof of concept I did for a workshop at the Hacks/Hackers Summit in Baltimore, demonstrating what agents can do, how to design workflows, and approaches to coding them.
Super happy to welcome Nvidia as our latest enterprise hub customer. They have almost 2,000 team members using Hugging Face, and close to 20,000 followers of their org. Can't wait to see what they'll open-source for all of us in the coming months!
How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.
- We found that VLMs can self-improve reasoning performance through a reflection mechanism, and importantly, this approach can scale through test-time computing.
- Evaluation on comprehensive and diverse Vision-Language reasoning tasks are included !
The cleaning process consists of: - Joining the separate splits together / add split column - Converting string messages into list of structs - Removing empty system prompts