We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch. Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.
💎 Key Highlights:
•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.
•• Pure Quality: Curated from 500+ Elite Sources
•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.
🤝 Open for Collaboration!
We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.
Hundreds of AI leaderboards exist on HuggingFace. Knowing which ones the community actually trusts has never been easy — until now.
Leaderboard of Leaderboards (LoL) ranks the leaderboards themselves, using live HuggingFace trending scores and cumulative likes as the signal. No editorial curation. No manual selection. Just what the global AI research community is actually visiting and endorsing, surfaced in real time.
Sort by trending to see what is capturing attention right now, or by likes to see what has built lasting credibility over time. Nine domain filters let you zero in on what matters most to your work, and every entry shows both its rank within this collection and its real-time global rank across all HuggingFace Spaces.
The collection spans well-established standards like Open LLM Leaderboard, Chatbot Arena, MTEB, and BigCodeBench alongside frameworks worth watching. FINAL Bench targets AGI-level evaluation across 100 tasks in 15 domains and recently reached the global top 5 in HuggingFace dataset rankings. Smol AI WorldCup runs tournament-format competitions for sub-8B models scored via FINAL Bench criteria. ALL Bench aggregates results across frameworks into a unified ranking that resists the overfitting risks of any single standard.
The deeper purpose is not convenience. It is transparency. How we measure AI matters as much as the AI we measure.
Today we’re publicly releasing Kanon 2 Enricher, and with it, an entirely new class of AI model that we’re calling a hierarchical graphitization model. This is fundamentally different from both universal extraction models and generative models.
As a hierarchical graphitization model, Kanon 2 Enricher natively outputs a 𝗸𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗴𝗿𝗮𝗽𝗵 rather than tokens, which makes it architecturally incapable of hallucinating or inventing text that wasn’t present in the input.
What that enables in practice is unlike any other model or ML architecture on the market:
• 𝗡𝗼 𝗵𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀 🤖 It cannot hallucinate. All references and links are stored as spans, meaning exact character offsets anchored to the original text.
• 𝗛𝗶𝗲𝗿𝗮𝗿𝗰𝗵𝗶𝗰𝗮𝗹 𝘀𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻, 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 📑 It deconstructs a document’s full nested hierarchy, down to chapters, sections, clauses, schedules, signatures, and even singular sentences, and classifies each span with dozens of contextual features.
• 𝗘𝗻𝘁𝗶𝘁𝘆 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻, 𝗱𝗶𝘀𝗮𝗺𝗯𝗶𝗴𝘂𝗮𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝗹𝗶𝗻𝗸𝗶𝗻𝗴 🔗 It resolves what references actually point to, then links entities, citations, and cross-references into a single coherent graph.
• 𝗚𝗿𝗮𝗽𝗵-𝗳𝗶𝗿𝘀𝘁 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 🏃➡️ Small enough to run locally on a consumer PC with sub-second latency, and it stays reliable on long documents where front
Dear Hugging Face team, can we please have a way to archive hf repositories / spaces? I have a bunch of spaces that used to work but don't any more due to the hf space implementations changing and i think it would be good if I could archive those like in GitHub.
React to this post if you want to see this feature! 💡