AI & ML interests

Contributors who are invited to beta-test our next big feature! Contact us if you want to join this team :-)

Recent Activity

Norod78 
posted an update 3 months ago
view post
Post
1762
Multilingual Tokenization Showdown
Analyzing 12 LLM Tokenizers Across 204 Languages.

First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages:
Norod78/WikiCat-Multilingual

For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language).

You can see a slideshow summary of the results here:
https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html

I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo:
https://github.com/Norod/wikicat-tokenizer-eval

Post on X:
https://x.com/Norod78/status/1984366900550266999

davanstrien 
posted an update 5 months ago
davanstrien 
posted an update 8 months ago
view post
Post
3696
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp
  • 1 reply
·
davanstrien 
posted an update 9 months ago
view post
Post
2393
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)