File size: 5,836 Bytes
7e1725c cd11b2f 7e1725c cd11b2f 7e1725c 13bd737 cd11b2f 079a835 6a1eb2f cd11b2f bf15251 cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 7e1725c fd45325 7e1725c fd45325 cd11b2f 7e1725c cd11b2f 7e1725c fd45325 7e1725c fd45325 cd11b2f 7e1725c cd11b2f a86005f cd11b2f 7e1725c cd11b2f 7e1725c cd11b2f 19b6339 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
license: apache-2.0
base_model:
- Qwen/Qwen3-32B
- Qwen/Qwen2.5-72B-Instruct
tags:
- merge
- frankenmerge
- qwen
---
# Qwen3-72B-Synthesis
This still doesn't work, I'm trying to fix it.
A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`.
## Model Description
**Qwen3-72B-Synthesis** is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern **Qwen3 architecture** while inheriting the vast, high-quality knowledge of the 72B-scale **Qwen2.5-Instruct** model.
This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.
The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor."
## Model Details
* **Architecture:** Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`)
* **Parameters:** ~72 Billion
* **Layers:** 80
* **Foundation:** `Qwen/Qwen3-32B`
* **Donor:** `Qwen/Qwen2.5-72B-Instruct`
* **Tokenizer:** `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`)
## Model Creation Process
The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.
### Phase 1: Foundation Upscaling
First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated **self-interpolation** script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.
### Phase 2: Donor Alignment
The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved:
1. Creating an empty 80-layer model shell with the pure Qwen3 architecture.
2. Surgically removing all `.bias` tensors from the Qwen2.5 weights.
3. Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
4. Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.
### Phase 3: Knowledge Transplant via MergeKit
With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.
The following `MergeKit` configuration was used:
```yaml
merge_method: linear
base_model: ./Qwen3-32B-Upscaled
dtype: bfloat16
slices:
# Slice 1: Blend the bottom 32 layers
- merge_method: linear
sources:
- model: ./Qwen3-32B-Upscaled
layer_range: [0, 32]
parameters:
weight: 0.5
- model: ./Qwen2.5-72B-Instruct-Aligned
layer_range: [0, 32]
parameters:
weight: 0.5
# Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
- merge_method: passthrough
sources:
- model: ./Qwen2.5-72B-Instruct-Aligned
layer_range: [32, 48]
# Slice 3: Blend the top layers
- merge_method: linear
sources:
- model: ./Qwen3-32B-Upscaled
layer_range: [32, 64]
parameters:
weight: 0.5
- model: ./Qwen2.5-72B-Instruct-Aligned
layer_range: [48, 80]
parameters:
weight: 0.5
tokenizer_source: ./Qwen3-32B-Upscaled
```
## How to Use
This model uses the standard Qwen ChatML prompt format.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "cognitivecomputations/Qwen3-72B-Synthesis"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
## Intended Use and Limitations
**This is an experimental model and should be considered a high-quality checkpoint, not a finished product.**
* **Fine-tuning is highly recommended.** While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
* The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
* This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.
## Acknowledgements
This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard and Arcee.ai. |