Building on HF

Nathan Habib PRO

SaylorTwift

AI & ML interests

Evals

Recent Activity

new activity 1 day ago

ibm-granite/granite-4.1-3b:Add GSM8K evaluation result

new activity 1 day ago

ibm-granite/granite-4.1-3b:Add MMLU-Pro evaluation result

new activity 1 day ago

ibm-granite/granite-4.1-3b:Add GPQA Diamond evaluation result

View all activity

Organizations

reacted to imnotkitty's post with 🔥 7 days ago

Post

3943

tencent/Hy3-preview is out: an open-weights MoE reasoning model.

✅ 295B total / 21B active / 256K context
✅ Fused fast-and-slow thinking in a single model
✅ First model trained on Hunyuan's rebuilt pretraining + RL infra (Feb → Apr)

Benchmarks:
👉 SWE-Bench Verified, Terminal-Bench 2.0, BrowseComp, WideSearch — competitive results, particularly strong on agentic tool use
👉 Top score on Tsinghua's 2026 Spring math PhD qualifying exam
👉 Strong context-learning and instruction-following on Tencent's CL-bench / CL-bench-Life

More details can be found in my article: https://huggingface.co/blog/imnotkitty/hy3-preview

2 replies

reacted to OzTianlu's post with 🔥 about 2 months ago

Post

5401

Arcade-3B — SmolReasoner
NoesisLab/Arcade-3B
Arcade-3B is a 3B instruction-following and reasoning model built on SmolLM3-3B. It is the public release from the ARCADE project at NoesisLab, which investigates the State–Constraint Orthogonality Hypothesis: standard Transformer hidden states conflate factual content and reasoning structure in the same subspace, and explicitly decoupling them improves generalization.

5 replies

reacted to lysandre's post with 🔥👍🚀 8 months ago

Post

8708

We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!

6 replies

reacted to Tonic's post with 🤗 8 months ago

Post

3675

🫡 I am the first and only one to like the French Tax Code Dataset

that's it , that's the post

find the dataset here : louisbrulenaudet/code-impots
follow : @louisbrulenaudet

3 replies

reacted to eliebak's post with 🔥 9 months ago

Post

4833

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

reacted to nicolay-r's post with 🔥 10 months ago

Post

3569

🚀 For those who interested in summarization of the long textual reports in medical domain 📝🩺, @Xiaolihai and I delighted to share that we experiment with distillation tuning adaptation for Qwen-2.5 0.5B. We use reports from the MultiClinSum dataset and pass it through 72B version to retrieve report explanations in order to initiate ditillation tuning for 0.5B model. We experiment with passages written in English, French, Portuguese, and Spanish.

🔑 We find that using distil-technique results in 2-4% performance increment on fine-tuning and similar improvements for reports in English (non-official and official evaluation). For the other it results in systems that perform similar to the convential tuning (standard) (see result below).

Dataset: https://zenodo.org/records/15459174
Competition: https://participants-area.bioasq.org/general_information/MultiClinSum/
Github: https://github.com/nicolay-r/distil-tuning-llm
model: nicolay-r/qwen25-05b-multiclinsum-distil

3 replies

reacted to loubnabnl's post with ❤️ 12 months ago

Post

7210

SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱

And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

4 replies

reacted to AdinaY's post with 🚀 12 months ago

Post

1867

Data quality is the new frontier for LLM performance.

Ultra-FineWeb 📊 a high-quality bilingual dataset released by OpenBMB

openbmb/Ultra-FineWeb

✨ MIT License
✨ 1T English + 120B Chinese tokens
✨ Efficient model-driven filtering

2 replies

reacted to codelion's post with 🚀 12 months ago

Post

2668

Introducing Pivotal Token Search (PTS): A new technique for targeted LLM alignment

Excited to share Pivotal Token Search (PTS), a technique for identifying and optimizing critical decision points in LLM generations!

GitHub repository: https://github.com/codelion/pts

What is PTS?
PTS helps identify specific "pivotal tokens" that dramatically shift the probability of a successful generation. Unlike traditional DPO which treats all tokens equally, PTS focuses optimization on the tokens that actually matter for success.

Inspired by Microsoft's recent Phi-4 paper (which used this technique to achieve SOTA reasoning with only 14B parameters), PTS is especially effective for:
- Mathematical reasoning
- Coding tasks
- Multi-step problem solving
- Any domain where specific decision points strongly impact outcomes

What we're releasing today: codelion/pivotal-token-search-68241145d8b8502122f3ce4f

1. Open-source code:
- Complete implementation of the PTS algorithm
- Data generation pipelines
- Usage examples and documentation

2. Huggingface resources:
- Datasets collection: https://huggingface.co/datasets?other=pts
* Pre-generated preference pairs for various domains
* Ready to use in your DPO training pipelines

- Models collection: https://huggingface.co/models?other=pts
* Pre-trained models fine-tuned with PTS
* Specialized versions for different reasoning tasks

The algorithm is straightforward to implement and can significantly improve your model's reasoning capabilities. Check out the repository for details on getting started!

We welcome feedback, contributions, and collaborations. Let us know if you use PTS in your projects!

reacted to fdaudens's post with 👀 12 months ago

Post

828

Hey! I built an AI Agent to query the FOIA API for a workshop at the Hacks/Hackers Summit in Baltimore and you can do it too!

It’s a quick proof of concept to demo what agents can do, how to design workflows, and how to approach the coding side. TWant a fun project to learn how AI agents work? I built one that queries the FOIA API — and you can too!

It's a quick proof of concept I did for a workshop at the Hacks/Hackers Summit in Baltimore, demonstrating what agents can do, how to design workflows, and approaches to coding them.

- Slides https://docs.google.com/presentation/d/1lbf5K0yi213N7uxGnVKJdGWq2i0GayWj4vIcLkVlwD8/edit?usp=sharing
- Colab notebook https://colab.research.google.com/drive/1iw0qZyTni_6BcK0jj1x6gTfjm85NlaGv
- Gradio app: https://huggingface.co/spaces/JournalistsonHF/foia-agent
- MCP version to plug into Claude, Cursor, etc: https://huggingface.co/spaces/JournalistsonHF/foia-mcp-tools

Feel free to use the Gradio app for real FOIA requests, but also to improve it (I'm far from being a good coder) or adapt it for other countries.

And shout-out to everyone who powered through the workshop! 😅

1 reply

reacted to clem's post with 🚀🤗 about 1 year ago

Post

3114

Just crossed half a million public apps on Hugging Face. A new public app is created every minute these days 🤯🤯🤯

What's your favorite? http://hf.co/spaces

3 replies

reacted to clem's post with 👍🔥 about 1 year ago

Post

5990

Super happy to welcome Nvidia as our latest enterprise hub customer. They have almost 2,000 team members using Hugging Face, and close to 20,000 followers of their org. Can't wait to see what they'll open-source for all of us in the coming months!

Nvidia's org:

nvidia
Enterprise hub: https://huggingface.co/enterprise

reacted to elliesleightholm's post with 🤗 over 1 year ago

Post

2868

I made a beginners guide to Hugging Face Spaces 🤗 I hope it's useful to some of you :)

YouTube video: https://www.youtube.com/watch?v=xqdTFyRdtjQ

Blog: https://www.marqo.ai/blog/how-to-create-a-hugging-face-space

8 replies

posted an update over 1 year ago

Post

1990

How do I test an LLM for my unique needs?
If you work in finance, law, or medicine, generic benchmarks are not enough.
This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.

https://github.com/argilla-io/argilla-cookbook/blob/main/domain-eval/README.md

reacted to Symbol-LLM's post with 🔥 over 1 year ago

Post

1312

🥳 Thrilled to introduce our recent efforts on bootstrapping VLMs for multi-modal chain-of-thought reasoning !

📕 Title: Vision-Language Models Can Self-Improve Reasoning via Reflection

🔗 Link: Vision-Language Models Can Self-Improve Reasoning via Reflection (2411.00855)

😇Takeaways:

- We found that VLMs can self-improve reasoning performance through a reflection mechanism, and importantly, this approach can scale through test-time computing.

- Evaluation on comprehensive and diverse Vision-Language reasoning tasks are included !

reacted to cfahlgren1's post with ❤️ over 1 year ago

Post

3286

You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned

1 reply

Nathan Habib PRO

AI & ML interests

Recent Activity

Organizations

SaylorTwift's activity