Ali El Filali's picture

Ali El Filali PRO

alielfilali01

AI & ML interests

AI Psychometrician ? | NLP (mainly for Arabic) | Interests include Reinforcement Learning and Cognitive sciences among others

Recent Activity

updated a dataset about 14 hours ago
OALL/requests_v2
liked a model about 19 hours ago
nvidia/GR00T-N1-2B
View all activity

Organizations

Gradio-Themes-Party's profile picture Arabic Machine Learning 's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Stable Diffusion Dreambooth Concepts Library's profile picture Blog-explorers's profile picture ASAS AI's profile picture Nt3awnou's profile picture Qwen's profile picture Mixed Arabic Datasets's profile picture ZeroGPU Explorers's profile picture 2A2I Legacy Models & Datasets's profile picture AtlasIA's profile picture 2A2I's profile picture MLX Community's profile picture Open Arabic LLM Leaderboard's profile picture Social Post Explorers's profile picture Cohere Labs Community's profile picture Dev Mode Explorers's profile picture Chinese LLMs on Hugging Face's profile picture ThinkAI's profile picture KABOUR's profile picture Hugging Face Discord Community's profile picture llmc's profile picture Arabic Translation Prompt Engineering's profile picture Inception's profile picture Dataset Tools's profile picture ml-fw-prerelease's profile picture Data Is Better Together Contributor's profile picture Donut Earthers ๐Ÿฉ's profile picture QudraTech's profile picture 3C3H's profile picture Conception's profile picture Inception & MBZUAI VLM Eval Team's profile picture

alielfilali01's activity

reacted to BramVanroy's post with โค๏ธ 8 days ago
view post
Post
3031
๐Ÿ“ข๐Ÿ’พ Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
๐Ÿ“„ data: BramVanroy/CommonCrawl-CreativeCommons
๐Ÿงฐ software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

๐ŸŒ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

๐Ÿ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
  • 1 reply
ยท
reacted to lukmanaj's post with ๐Ÿ‘๐Ÿค— 12 days ago
view post
Post
2174
Iโ€™m excited to share that Iโ€™ve completed the Hugging Face Agents Course and earned my certificate.

Over the past few months, I explored how to build intelligent, autonomous agents using cutting-edge tools like smolagents, LlamaIndex, and LangGraph. The course covered everything from the fundamentals of agents to advanced topics like fine-tuning for function-calling, observability, evaluation, and even agents in games.

Some key content included:

1. Introduction to AI Agents

2. Agentic RAG use cases

3. Multi-framework implementation: smolagents, LlamaIndex, and LangGraph

4. Building, testing, and certifying a complete agent project

This was a hands-on, practical experience that deepened my understanding of how to design reliable, tool-using LLM agents. Looking forward to leveraging these skills in real-world applications in healthcare, logistics, and beyond.

Many thanks to the Hugging Face team for putting this together.
Letโ€™s build safe and useful agents!

ยท
posted an update 13 days ago
reacted to anakin87's post with ๐Ÿ‘ 13 days ago
view post
Post
3329
๐—œ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐—ฎ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ผ ๐˜€๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ ๐—ฒ๐˜ƒ๐—ฒ๐—ป๐˜๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—š๐—ฅ๐—ฃ๐—ข! ๐Ÿ‘‘ ๐Ÿ—“๏ธ

โœ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

I experimented with GRPO lately.

I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...

I wanted a different challenge, like ๐˜๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ด ๐—ฎ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ผ ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐—ฎ ๐˜€๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฎ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—ฒ๐˜ƒ๐—ฒ๐—ป๐˜๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฝ๐—ฟ๐—ถ๐—ผ๐—ฟ๐—ถ๐˜๐—ถ๐—ฒ๐˜€.

Choosing an original problem forced me to:
๐Ÿค” Think about the problem setting
๐Ÿงฌ Generate data
๐Ÿค Choose the right base model
๐Ÿ† Design reward functions (and experiencing reward hacking)
๐Ÿ”„ Run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding ๐Ÿ˜„ experience.


I learned a lot of things, that I want to share with you. ๐Ÿ‘‡
โœ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐Ÿ’ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐Ÿค— Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837
  • 2 replies
ยท
reacted to ImranzamanML's post with ๐Ÿ‘๐Ÿง  16 days ago
view post
Post
2866
๐Ÿš€ New paper out: "Improving Arabic Multi-Label Emotion Classification using Stacked Embeddings and Hybrid Loss Function"
Improving Arabic Multi-Label Emotion Classification using Stacked Embeddings and Hybrid Loss Function (2410.03979)

In this work, we tackle some major challenges in Arabic multi-label emotion classification especially the issues of class imbalance and label correlation that often hurt model performance, particularly for minority emotions.

Our approach:

Stacked contextual embeddings from fine-tuned ArabicBERT, MarBERT, and AraBERT models.

A meta-learning strategy that builds richer representations.

A hybrid loss function combining class weighting, label correlation matrices, and contrastive learning to better handle class imbalances.

๐Ÿง  Model pipeline: stacked embeddings โ†’ meta-learner โ†’ Bi-LSTM โ†’ fully connected network โ†’ multi-label classification.

๐Ÿ” Extensive experiments show significant improvements across Precision, Recall, F1-Score, Jaccard Accuracy, and Hamming Loss.
๐ŸŒŸ The hybrid loss function in particular helped close the gap between majority and minority classes!

We also performed ablation studies to break down each componentโ€™s contribution and the results consistently validated our design choices.

This framework isn't just for Arabic it offers a generalizable path for improving multi-label emotion classification in other low-resource languages and domains.

Big thanks to my co-authors: Muhammad Azeem Aslam, Wang Jun, Nisar Ahmed, Li Yanan, Hu Hongfei, Wang Shiyu, and Xin Liu!

Would love to hear your thoughts on this work! ๐Ÿ‘‡
reacted to shekkizh's post with โค๏ธ 20 days ago
view post
Post
1883
Think AGI is just around the corner? Not so fast.

When OpenAI released its Computer-Using Agent (CUA) API, I happened to be playing Wordle ๐Ÿงฉ and thought, why not see how the model handles it?
Spoiler: Wordle turned out to be a surprisingly effective benchmark.
So Romain Cosentino Ph.D. and I dug in and analyzed the results of several hundred runs.

๐Ÿ”‘ Takeaways
1๏ธโƒฃ Even the best computer-using models struggle with simple, context-dependent tasks.ย 
2๏ธโƒฃ Visual perception and reasoning remain major hurdles for multimodal agents.
3๏ธโƒฃ Real-world use cases reveal significant gaps between hype and reality. Perception accuracy drops to near zero by the last turn ๐Ÿ“‰

๐Ÿ”— Read our arxiv article for more details https://www.arxiv.org/abs/2504.15434
  • 3 replies
ยท
reacted to clem's post with ๐Ÿค— 2 months ago
view post
Post
4674
We just crossed 1,500,000 public models on Hugging Face (and 500k spaces, 330k datasets, 50k papers). One new repository is created every 15 seconds. Congratulations all!
ยท
reacted to BrigitteTousi's post with ๐Ÿš€ 2 months ago
reacted to MohamedRashad's post with ๐Ÿš€โค๏ธ 3 months ago
posted an update 3 months ago
view post
Post
1009
๐Ÿšจ Arabic LLM Evaluation ๐Ÿšจ

Few models join the ranking of https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.
reacted to merve's post with ๐Ÿš€๐Ÿง  3 months ago
view post
Post
6497
Google just released PaliGemma 2 Mix: new versatile instruction vision language models ๐Ÿ”ฅ

> Three new models: 3B, 10B, 28B with res 224, 448 ๐Ÿ’™
> Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐Ÿคฏ

Read more https://huggingface.co/blog/paligemma2mix
Try the demo google/paligemma2-10b-mix
All models are here google/paligemma-2-mix-67ac6a251aaf3ee73679dcc4
reacted to dreamerdeo's post with ๐Ÿค—๐Ÿš€ 3 months ago
view post
Post
2843
๐Ÿš€ Excited to share our technical report on the Southeast Asian multilingual model Sailor2 and its latest updates!

Our 49-page report details Sailor2's development journey, including multilingual data cleaning, small model data mixture simulations, multi-stage continual pre-training, multi-stage post-training, and multi-cultural multi-lingual evaluations. Sailor2 aims to streamline the multilingual model pre-training process efficiently for the community.

๐Ÿงญ We highlight Sailor2's impressive performance in low-resource language translation scenarios and its cultural understanding advantages in Southeast Asia, promoting practical applications for regional languages.

Model updates include:ย 
๐Ÿ’ก More precise outputs: Reduced redundancy in model outputs through refined post-training data and optimization techniques.ย 
๐ŸŒˆ Handling longer texts: Expanded to handle up to 128K context length in Southeast Asian languages through long-text training.ย 
โšก๏ธ Faster inference: Achieved 2.5x faster inference speed with speculative decoding.ย 
๐ŸŒช๏ธ More model sizes: Introduced new sizes of 3B and 14B through model pruning.

๐ŸŒŸ All models are Apache-licensed for commercial use; development tools (code, resources) are open-source.

๐Ÿ“š Technical report: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs (2502.12982)ย 
๐Ÿค–๏ธ Models: sail/sailor2-language-models-674d7c9e6b4dbbd9a869906bย 
๐Ÿ’ฌ Demo: sail/Sailor2-20B-Chatย 
๐Ÿ“ฃ Sailor2 community: sailor2
reacted to fantos's post with ๐Ÿ”ฅ 4 months ago
view post
Post
4399
๐Ÿš€ HuggingFace Spaces Ranking Tracker - Your Complete AI Trend Analytics!

Introducing the Spaces Ranking Tracker, a comprehensive analytics dashboard that tracks and analyzes every AI application in the HuggingFace ecosystem.

โœจ Key Features:
โ€ข Real-time tracking of daily ranking changes over 30 days
โ€ข Detailed analysis of top 100 trending spaces
โ€ข User-based integrated score visualization
โ€ข One-click access to space details
โ€ข Interactive rank change graphs

๐Ÿ“Š Dashboard Components:
1. Main Dashboard
- Daily rank trend graphs
- Top 20 creators' combined score chart
- Detailed space information cards
- Real-time trending score updates

2. Space Detailed Analysis
- Creation date, current rank, and trending score
- 30-day ranking history
- Direct space access
- Custom color coding for intuitive rank display

๐ŸŽฏ How to Use:
โ€ข Monitor latest AI community trends
โ€ข Track your project's performance
โ€ข Discover popular AI demos
โ€ข Analyze competing projects
โ€ข Follow AI ecosystem dynamics

3. Interactive Features
- Custom filtering options
- Sorting by various metrics
- Detailed performance statistics
- Comprehensive trending scores
- Historical data tracking

Stay on top of every movement in the HuggingFace ecosystem with daily ranking updates! ๐Ÿ‘‰ Try it now!

๐Ÿ”— Access Dashboard: fantos/Ranking-Tracker
#HuggingFace #AI #DataVisualization #TrendAnalysis #AITrends
  • 1 reply
ยท
reacted to burtenshaw's post with ๐Ÿš€ 4 months ago
view post
Post
3380
Manic few days in open source AI, with game changing development all over the place. Here's a round up of the resources:

- The science team at @huggingface reproduced and open source the seek r1. https://github.com/huggingface/open-r1
- @qwen released a series of models with 1 million token context! https://qwenlm.github.io/blog/qwen2.5-1m/
- SmolVLM got even smaller with completely new variants at 256m and 500m https://huggingface.co/blog/smolervlm

There's so much you could do with these developments. Especially combining them together into agentic applications or fine-tuning them on your use case.
  • 1 reply
ยท
reacted to AdinaY's post with ๐Ÿ”ฅ 4 months ago
view post
Post
2858
BIG release by DeepSeek AI๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

DeepSeek-R1 & DeepSeek-R1-Zero: two 660B reasoning models are here, alongside 6 distilled dense models (based on Llama & Qwen) for the community!
deepseek-ai
deepseek-ai/DeepSeek-R1

โœจ MIT License : enabling distillation for custom models
โœจ 32B & 70B models match OpenAI o1-mini in multiple capabilities
โœจ API live now! Access Chain of Thought reasoning with model='deepseek-reasoner'