The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning π§ββοΈπ¦ΈββοΈπ¦Ήπ§ββοΈ
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually
- C5f (BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2; - C5r (BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.
It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
π I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more:
1οΈβ£ Faster ONNX and OpenVINO backends for SparseEncoder models Usage is as simple as backend="onnx" or backend="openvino" when initializing a SparseEncoder to get started, but I also included utility functions for optimization, dynamic quantization, and static quantization, plus benchmarks.
2οΈβ£ New n-tuple-scores output format from mine_hard_negatives This new output format is immediately compatible with the MarginMSELoss and SparseMarginMSELoss for training SentenceTransformer, CrossEncoder, and SparseEncoder losses.
3οΈβ£ Gathering across devices When doing multi-GPU training using a loss that has in-batch negatives (e.g. MultipleNegativesRankingLoss), you can now use gather_across_devices=True to load in-batch negatives from the other devices too! Essentially a free lunch, pretty big impact potential in my evals.
4οΈβ£ Trackio support If you also upgrade transformers, and you install trackio with pip install trackio, then your experiments will also automatically be tracked locally with trackio. Just open up localhost and have a look at your losses/evals, no logins, no metric uploading.
5οΈβ£ MTEB Documentation We've added some documentation on evaluating SentenceTransformer models properly with MTEB. It's rudimentary as the documentation on the MTEB side is already great, but it should get you started.
Plus many more smaller features & fixes (crash fixes, compatibility with datasets v4, FIPS compatibility, etc.).
Big thanks to all of the contributors for helping with the release, many of the features from this release were proposed by others. I have a big list of future potential features that I'd love to add, but I'm
Fine-tune Gemma3n on videos with audios inside with Colab A100 π₯ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes π«‘ we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ππ» merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens π€―
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).
Here, I built a simple editor first for @dstackai, and I will share the live endpoint this week. Let me know what you think about this approach.
Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.