Instructions to use fluently/FluentlyQwen3-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fluently/FluentlyQwen3-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="fluently/FluentlyQwen3-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("fluently/FluentlyQwen3-4B") model = AutoModelForCausalLM.from_pretrained("fluently/FluentlyQwen3-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use fluently/FluentlyQwen3-4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "fluently/FluentlyQwen3-4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fluently/FluentlyQwen3-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/fluently/FluentlyQwen3-4B
- SGLang
How to use fluently/FluentlyQwen3-4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "fluently/FluentlyQwen3-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fluently/FluentlyQwen3-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "fluently/FluentlyQwen3-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fluently/FluentlyQwen3-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use fluently/FluentlyQwen3-4B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for fluently/FluentlyQwen3-4B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for fluently/FluentlyQwen3-4B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for fluently/FluentlyQwen3-4B to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="fluently/FluentlyQwen3-4B", max_seq_length=2048, ) - Docker Model Runner
How to use fluently/FluentlyQwen3-4B with Docker Model Runner:
docker model run hf.co/fluently/FluentlyQwen3-4B
FluentlyQwen3 4B
Introducing a new LLM model from Project Fluently. The goal of this model is to improve the base model by training it on diverse datasets. This model is obtained by SFT and GRPO training and step-by-step merging.
Model details
- Developed by: @fluently
- Model type: Causal Language Models (Qwen3ForCausalLM, LM Transformer)
- Number of Parameters: 4.0B
- Number of Paramaters (Non-Embedding): 3.6B
- Number of Layers: 36
- Number of Attention Heads (GQA): 32 for Q and 8 for KV
- Context Length: 32,768 natively and 131,072 tokens with YaRN
- License: Apache-2.0
Recipe
*The recipe is approximate, there are some inaccuracies.
Strengths
General improvements
| Task | Result |
|---|---|
| Basic Communication | Improved |
| Translation | Improved |
| Mathematics | Improved |
| Physics | Improved |
| Biology | Improved |
| Medicine | Improved |
| Coding | Improved |
| Agent Functions | Improved |
Quickstart
Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.
from transformers
import AutoModelForCausalLM, AutoTokenizer
model_name = "fluently/FluentlyQwen3-4B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
Switching Between Thinking and Non-Thinking Mode
The
enable_thinkingswitch is also available in APIs created by SGLang and vLLM.
enable_thinking=True
By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting enable_thinking=True or leaving it as the default value in tokenizer.apply_chat_template, the model will engage its thinking mode.
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # True is the default value for enable_thinking
)
In this mode, the model will generate think content wrapped in a <think>...</think> block, followed by the final response.
For thinking mode, use
Temperature=0.6,TopP=0.95,TopK=20andMinP=0(the default setting ingeneration_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
enable_thinking=False
We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Setting enable_thinking=False disables thinking mode
)
In this mode, the model will not generate any think content and will not include a <think>...</think> block.
For non-thinking mode, we suggest using
Temperature=0.7,TopP=0.8,TopK=20, andMinP=0.
Special thanks
🤗 We are grateful for open source resources, technologies and assistance from:
- Downloads last month
- 6
Model tree for fluently/FluentlyQwen3-4B
Base model
Qwen/Qwen3-4B-Base
