Kimi K2 tech report is full of gems as always. Here are my notes on it:
> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher) > Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient. > They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.
With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.
> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once. > They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style > They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.
The infra is also very nice, quick summary: > PP=16 (1F1B schedule, a bit custom), EP=16, zero1 > No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU
Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.
A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.
Today, I'm thrilled to introduce our first step in this direction.
In a nutshell:
π Effortlessly run prompts and models over your data. π Agentic search for accuracy and real-time information. πΌοΈ Familiar, minimalistic interface for interacting with data. π― Human feedback 2.0: Your input directly improves generated data. π― Access hundreds of open models and leading inference providers.
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets - Parameter count analysis via safetensors metadata - Trending content discovery - Find similar models/datasets functionality - 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)