Stefano Fiorucci's picture

In a Training Loop 🔄

Stefano Fiorucci PRO

anakin87

·

AI & ML interests

Language Models: orchestration, post-training, GRPO, synthetic data... Contributing to Haystack LLM framework 🏗️

Recent Activity

liked a model about 6 hours ago

VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M

updated a model 1 day ago

anakin87/LFM2-2.6B-mr-tictactoe

updated a dataset 1 day ago

anakin87/tictactoe-filtered

View all activity

Organizations

Posts 23

Post

387

💭 Do thinking traces make Language Models learn better? Curious what others think

𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼
You take an instruction-following LM.
You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe.
Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...

During training, the model could just output answers, but a common choice is to make it also output thinking traces.

𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻
Does forcing the model to produce thinking traces during training actually improve learning❓

💬 I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.

From what I've understood so far, the answer seems to be 𝘆𝗲𝘀.

1️⃣ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.

2️⃣ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.

3️⃣ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.

Articles 4

Article

30

Exploring Environments Hub: Your Language Model needs better (open) environments to learn

View all Articles

Collections 4

View 4 collections

spaces 6

Gemma 3 270m IT

Chat with Gemma 3 270m IT

Fact Checking rocks!

Fact checking baseline. Dense retrieval + textual entailment

Phi 3.5 Mini ITA

Chat with an Italian Small Model

Gemma 2 2B Neogenesis ITA

Chat with an Italian Small Model

Gemma 2 9B Neogenesis ITA

9B Italian strong model 💪

Who killed Laura Palmer?

models 17

anakin87/LFM2-2.6B-mr-tictactoe

Text Generation • 3B • Updated 1 day ago • 261

anakin87/Phi-3.5-mini-ITA

Text Generation • 4B • Updated 13 days ago • 5.39k • 13

anakin87/Qwen3-0.6B-alphabet-sort-grpo

Text Generation • 0.6B • Updated Sep 4, 2025 • 9

anakin87/gemma-2-2b-ita-sft

Text Generation • 3B • Updated Jun 29, 2025

anakin87/electra-italian-xxl-cased-squad-it

Question Answering • 0.1B • Updated Jun 29, 2025 • 15 • 8

anakin87/gemma-2b-orpo

Text Generation • 3B • Updated Jun 29, 2025 • 30 • 28

anakin87/qwen-scheduler-7b-grpo

Text Generation • Updated Apr 26, 2025 • 6

anakin87/gemma-2-9b-neogenesis-ita

Text Generation • 9B • Updated Mar 10, 2025 • 1.3k • • 11

anakin87/gemma-2-2b-neogenesis-ita

Text Generation • 3B • Updated Jan 16, 2025 • 1.35k • • 6

anakin87/yo-Llama-3-8B-Instruct

Text Generation • 8B • Updated Jul 2, 2024 • 7 • 7

datasets 11

anakin87/tictactoe-filtered

Viewer • Updated 1 day ago • 174 • 21

anakin87/tictactoe

Viewer • Updated 1 day ago • 200 • 20

anakin87/Qwen3-0.6B-tuned-alphabet-sort-eval

Viewer • Updated Sep 4, 2025 • 15 • 8

anakin87/Qwen3-0.6B-alphabet-sort-eval

Viewer • Updated Sep 4, 2025 • 15 • 20

anakin87/events-scheduling

Viewer • Updated Apr 26, 2025 • 600 • 156 • 2

anakin87/evol-dpo-ita-reranked

Viewer • Updated Jan 14, 2025 • 19.8k • 25 • 5

anakin87/gemma-vs-gemma-preferences

Viewer • Updated Jan 14, 2025 • 24.7k • 7

anakin87/fine-instructions-ita-70k

Viewer • Updated Jan 14, 2025 • 69.9k • 28 • 4

anakin87/FineTome-single-turn-dedup

Viewer • Updated Jan 11, 2025 • 83.3k • 11

anakin87/tulu-3-sft-mixture-with-language

Viewer • Updated Dec 11, 2024 • 939k • 30

View 11 datasets