lmao why would I not use them ?
Ilyas Moutawwakil
IlyasMoutawwakil
AI & ML interests
Optimization, LLMs, Hardware, Backends, ..
Recent Activity
published an
article
about 18 hours ago
Mixture of Experts (MoEs) in Transformers upvoted an article 3 days ago
GGML and llama.cpp join HF to ensure the long-term progress of Local AI upvoted an article 9 days ago
Custom Kernels for All from Codex and Claude Organizations
replied to their post about 1 month ago
posted an
update about 1 month ago
Post
3037
Transformers v5 just landed! ๐
It significantly unifies and reduces modeling code across architectures, while opening the door to a whole new class of performance optimizations.
My favorite new feature? ๐ค
The new dynamic weight loader + converter. Hereโs why ๐
Over the last few months, the core Transformers maintainers built an incredibly fast weight loader, capable of converting tensors on the fly while loading them in parallel threads. This means weโre no longer constrained by how parameters are laid out inside the safetensors weight files.
In practice, this unlocks two big things:
- Much more modular modeling code. You can now clearly see how architectures build on top of each other (DeepSeek v2 โ v3, Qwen v2 โ v3 โ MoE, etc.). This makes shared bottlenecks obvious and lets us optimize the right building blocks once, for all model families.
- Performance optimizations beyond what torch.compile can do alone. torch.compile operates on the computation graph, but it canโt change parameter layouts. With the new loader, we can restructure weights at load time: fusing MoE expert projections, merging attention QKV projections, and enabling more compute-dense kernels that simply werenโt possible before.
Personally, I'm honored to have contributed in this direction, including the work on optimizing MoE implementations and making modeling code more torch-exportable, so these optimizations can be ported cleanly across runtimes.
Overall, Transformers v5 is a strong signal of where the community and industry are converging: Modularity and Performance, without sacrificing Flexibility.
Transformers v5 makes its signature from_pretrained an entrypoint where you can mix and match:
- Parallelism
- Quantization
- Custom kernels
- Flash/Paged attention
- Continuous batching
- ...
Kudos to everyone involved! I highly recommend the:
Release notes: https://github.com/huggingface/transformers/releases/tag/v5.0.0
Blog post: https://huggingface.co/blog/transformers-v5
It significantly unifies and reduces modeling code across architectures, while opening the door to a whole new class of performance optimizations.
My favorite new feature? ๐ค
The new dynamic weight loader + converter. Hereโs why ๐
Over the last few months, the core Transformers maintainers built an incredibly fast weight loader, capable of converting tensors on the fly while loading them in parallel threads. This means weโre no longer constrained by how parameters are laid out inside the safetensors weight files.
In practice, this unlocks two big things:
- Much more modular modeling code. You can now clearly see how architectures build on top of each other (DeepSeek v2 โ v3, Qwen v2 โ v3 โ MoE, etc.). This makes shared bottlenecks obvious and lets us optimize the right building blocks once, for all model families.
- Performance optimizations beyond what torch.compile can do alone. torch.compile operates on the computation graph, but it canโt change parameter layouts. With the new loader, we can restructure weights at load time: fusing MoE expert projections, merging attention QKV projections, and enabling more compute-dense kernels that simply werenโt possible before.
Personally, I'm honored to have contributed in this direction, including the work on optimizing MoE implementations and making modeling code more torch-exportable, so these optimizations can be ported cleanly across runtimes.
Overall, Transformers v5 is a strong signal of where the community and industry are converging: Modularity and Performance, without sacrificing Flexibility.
Transformers v5 makes its signature from_pretrained an entrypoint where you can mix and match:
- Parallelism
- Quantization
- Custom kernels
- Flash/Paged attention
- Continuous batching
- ...
Kudos to everyone involved! I highly recommend the:
Release notes: https://github.com/huggingface/transformers/releases/tag/v5.0.0
Blog post: https://huggingface.co/blog/transformers-v5
replied to their post about 1 month ago
posted an
update about 1 month ago
Post
2381
After 2 months of refinement, I'm happy to announce that a lot of Transformers' modeling code is now significantly more torch-compile & export-friendly ๐ฅ
Why it had to be done ๐
PyTorch's Dynamo compiler is increasingly becoming the default interoperability layer for ML systems. Anything that relies on torch.export or torch.compile, from model optimization to cross-framework integrations, benefits directly when models can be captured as a single dynamo-traced graph !
Transformers models are now easier to:
โ๏ธ Compile end-to-end with torch.compile backends
๐ฆ Export reliably via torch.export and torch.onnx.export
๐ Deploy to ONNX / ONNX Runtime, Intel Corporation's OpenVINO, NVIDIA AutoDeploy (TRT-LLM), AMD's Quark, Meta's Executorch and more hardware-specific runtimes.
This work aims at unblocking entire TorchDynamo-based toolchains that rely on exporting Transformers across runtimes and accelerators.
We are doubling down on Transformers commitment to be a first-class citizen of the PyTorch ecosystem, more exportable, more optimizable, and easier to deploy everywhere.
There are definitely some edge-cases that we still haven't addressed so don't hesitate to try compiling / exporting your favorite transformers and to open issues / PRs.
PR in the comments ! More updates coming coming soon !
Why it had to be done ๐
PyTorch's Dynamo compiler is increasingly becoming the default interoperability layer for ML systems. Anything that relies on torch.export or torch.compile, from model optimization to cross-framework integrations, benefits directly when models can be captured as a single dynamo-traced graph !
Transformers models are now easier to:
โ๏ธ Compile end-to-end with torch.compile backends
๐ฆ Export reliably via torch.export and torch.onnx.export
๐ Deploy to ONNX / ONNX Runtime, Intel Corporation's OpenVINO, NVIDIA AutoDeploy (TRT-LLM), AMD's Quark, Meta's Executorch and more hardware-specific runtimes.
This work aims at unblocking entire TorchDynamo-based toolchains that rely on exporting Transformers across runtimes and accelerators.
We are doubling down on Transformers commitment to be a first-class citizen of the PyTorch ecosystem, more exportable, more optimizable, and easier to deploy everywhere.
There are definitely some edge-cases that we still haven't addressed so don't hesitate to try compiling / exporting your favorite transformers and to open issues / PRs.
PR in the comments ! More updates coming coming soon !
reacted to tsungyi's post with ๐ฅ 5 months ago
Post
3732
Weโre excited to share that Cosmos Reason has surpassed 1 million downloads on Hugging Face!
Cosmos Reason is an open, customizable, commercial-ready 7B-parameter reasoning vision language model (VLM) designed for physical AI. By combining physics understanding, prior knowledge, and common sense reasoning, Cosmos Reason empowers AI agents and robots to operate intelligently in real-world environments.
Key applications already unlocked include:
โ Automating large-scale dataset curation and annotation
๐ค Powering robot planning and vision-language action (VLA) decision-making
๐ Driving advanced video analytics and actionable insight generation
Weโre proud to see a global community of developers using Cosmos Reason to teach robots to think like humansโand weโre just getting started.
โก Get started with Cosmos Reason 1 NIM, an easy-to-use microservice for AI model deployment: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos-reason1-7b?version=1
๐ See the leaderboard: facebook/physical_reasoning_leaderboard
Cosmos Reason is an open, customizable, commercial-ready 7B-parameter reasoning vision language model (VLM) designed for physical AI. By combining physics understanding, prior knowledge, and common sense reasoning, Cosmos Reason empowers AI agents and robots to operate intelligently in real-world environments.
Key applications already unlocked include:
โ Automating large-scale dataset curation and annotation
๐ค Powering robot planning and vision-language action (VLA) decision-making
๐ Driving advanced video analytics and actionable insight generation
Weโre proud to see a global community of developers using Cosmos Reason to teach robots to think like humansโand weโre just getting started.
โก Get started with Cosmos Reason 1 NIM, an easy-to-use microservice for AI model deployment: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos-reason1-7b?version=1
๐ See the leaderboard: facebook/physical_reasoning_leaderboard
posted an
update 7 months ago
Post
3513
๐ Optimum: The Last v1 Release ๐
Optimum v1.27 marks the final major release in the v1 series. As we close this chapter, we're laying the groundwork for a more modular and community-driven future:
- Optimum v2: A lightweight core package for porting Transformers, Diffusers, or Sentence-Transformers to specialized AI hardware/software/accelerators..
- OptimumโONNX: A dedicated package where the ONNX/ONNX Runtime ecosystem lives and evolves, faster-moving and decoupled from the Optimum core.
๐ฏ Why this matters:
- A clearer governance path for ONNX, fostering stronger community collaboration and improved developer experience..
- Enable innovation at a faster pace in a more modular, open-source environment.
๐ก What this means:
- More transparency, broader participation, and faster development driven by the community and key actors in the ONNX ecosystem (PyTorch, Microsoft, Joshua Lochner ๐, ...)
- A cleaner, more maintainable core Optimum, focused on extending HF libraries to special AI hardware/software/accelerators tooling and used by our partners (Intel Corporation, Amazon Web Services (AWS), AMD, NVIDIA, FuriosaAI, ...)
๐ ๏ธ Major updates I worked on in this release:
โ Added support for Transformers v4.53 and SmolLM3 in ONNX/ONNXRuntime.
โ Solved batched inference/generation for all supported decoder model architectures (LLMs).
โจ Big shoutout to @echarlaix for leading the refactoring work that cleanly separated ONNX exporter logic and enabled the creation of OptimumโONNX.
๐ Release Notes: https://lnkd.in/gXtE_qji
๐ฆ Optimum : https://lnkd.in/ecAezNT6
๐ Optimum-ONNX: https://lnkd.in/gzjyAjSi
#Optimum #ONNX #OpenSource #HuggingFace #Transformers #Diffusers
Optimum v1.27 marks the final major release in the v1 series. As we close this chapter, we're laying the groundwork for a more modular and community-driven future:
- Optimum v2: A lightweight core package for porting Transformers, Diffusers, or Sentence-Transformers to specialized AI hardware/software/accelerators..
- OptimumโONNX: A dedicated package where the ONNX/ONNX Runtime ecosystem lives and evolves, faster-moving and decoupled from the Optimum core.
๐ฏ Why this matters:
- A clearer governance path for ONNX, fostering stronger community collaboration and improved developer experience..
- Enable innovation at a faster pace in a more modular, open-source environment.
๐ก What this means:
- More transparency, broader participation, and faster development driven by the community and key actors in the ONNX ecosystem (PyTorch, Microsoft, Joshua Lochner ๐, ...)
- A cleaner, more maintainable core Optimum, focused on extending HF libraries to special AI hardware/software/accelerators tooling and used by our partners (Intel Corporation, Amazon Web Services (AWS), AMD, NVIDIA, FuriosaAI, ...)
๐ ๏ธ Major updates I worked on in this release:
โ Added support for Transformers v4.53 and SmolLM3 in ONNX/ONNXRuntime.
โ Solved batched inference/generation for all supported decoder model architectures (LLMs).
โจ Big shoutout to @echarlaix for leading the refactoring work that cleanly separated ONNX exporter logic and enabled the creation of OptimumโONNX.
๐ Release Notes: https://lnkd.in/gXtE_qji
๐ฆ Optimum : https://lnkd.in/ecAezNT6
๐ Optimum-ONNX: https://lnkd.in/gzjyAjSi
#Optimum #ONNX #OpenSource #HuggingFace #Transformers #Diffusers
reacted to clem's post with ๐ over 1 year ago
reacted to merve's post with ๐ค over 1 year ago
Post
6142
Fine-tune Florence-2 on any task ๐ฅ
Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP
Blog: https://huggingface.co/blog ๐
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing ๐
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!
This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA ๐
We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks ๐ฅน
See below how it looks like before and after FT ๐คฉ
Play with the demo here andito/Florence-2-DocVQA ๐โโ๏ธ
Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP
Blog: https://huggingface.co/blog ๐
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing ๐
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!
This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA ๐
We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks ๐ฅน
See below how it looks like before and after FT ๐คฉ
Play with the demo here andito/Florence-2-DocVQA ๐โโ๏ธ
Post
4167
Last week, Intel's new Xeon CPUs, Sapphire Rapids (SPR), landed on Inference Endpoints and I think they got the potential to reduce the cost of your RAG pipelines ๐ธ
Why ? Because they come with Intelยฎ AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU โก
I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Here's how it works:
- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.
- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker
Why ? Because they come with Intelยฎ AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU โก
I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Here's how it works:
- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.
- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker
posted an
update over 1 year ago
Post
4167
Last week, Intel's new Xeon CPUs, Sapphire Rapids (SPR), landed on Inference Endpoints and I think they got the potential to reduce the cost of your RAG pipelines ๐ธ
Why ? Because they come with Intelยฎ AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU โก
I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Here's how it works:
- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.
- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker
Why ? Because they come with Intelยฎ AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU โก
I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Here's how it works:
- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.
- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.
Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker
reacted to Molbap's post with ๐คฏ๐ค๐๐ฅ almost 2 years ago
Post
5528
๐๐ Exciting times for the document AI community!
We're thrilled to announce the release of some of the largest OCR datasets available to the public.
๐ฅ With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.
Here's how to access these datasets quickly:
This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:
pixparse/pdfa-eng-wds
pixparse/idl-wds
For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:
We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. ๐ค
Looking Ahead:
We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.
For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.
We're thrilled to announce the release of some of the largest OCR datasets available to the public.
๐ฅ With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.
Here's how to access these datasets quickly:
from datasets import load_dataset
pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:
pixparse/pdfa-eng-wds
pixparse/idl-wds
For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:
import chug
task_cfg = chug.DataTaskDocReadCfg(
page_sampling='all',
)
data_cfg = chug.DataCfg(
source='pixparse/pdfa-eng-wds',
split='train',
batch_size=None,
format='hfids',
num_workers=0,
)
data_loader = chug.create_loader(
data_cfg,
task_cfg,
)
sample = next(iter(data_loader))We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. ๐ค
Looking Ahead:
We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.
For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.
This is so cool and the kind of AI many industries are need for !
reacted to akhaliq's post with ๐ about 2 years ago
Post
Here is my selection of papers for today (9 Jan)
https://huggingface.co/papers
AGG: Amortized Generative 3D Gaussians for Single Image to 3D
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
TeleChat Technical Report
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Mixtral of Experts
https://huggingface.co/papers
AGG: Amortized Generative 3D Gaussians for Single Image to 3D
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
TeleChat Technical Report
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Mixtral of Experts