File size: 5,836 Bytes
7e1725c
cd11b2f
 
 
 
7e1725c
 
cd11b2f
 
7e1725c
13bd737
cd11b2f
 
079a835
6a1eb2f
cd11b2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf15251
cd11b2f
7e1725c
cd11b2f
7e1725c
cd11b2f
7e1725c
cd11b2f
7e1725c
cd11b2f
 
 
 
 
7e1725c
cd11b2f
7e1725c
cd11b2f
 
 
7e1725c
 
 
 
 
cd11b2f
7e1725c
cd11b2f
7e1725c
 
 
 
 
fd45325
7e1725c
 
 
fd45325
cd11b2f
 
 
7e1725c
 
 
cd11b2f
 
7e1725c
 
 
 
 
fd45325
7e1725c
 
 
fd45325
cd11b2f
7e1725c
cd11b2f
 
 
 
 
 
 
 
 
 
a86005f
cd11b2f
7e1725c
cd11b2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e1725c
cd11b2f
 
 
 
 
 
 
 
 
 
 
19b6339
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: apache-2.0
base_model:
- Qwen/Qwen3-32B
- Qwen/Qwen2.5-72B-Instruct
tags:
- merge
- frankenmerge
- qwen
---

# Qwen3-72B-Synthesis

This still doesn't work, I'm trying to fix it.

A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`.

## Model Description

**Qwen3-72B-Synthesis** is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern **Qwen3 architecture** while inheriting the vast, high-quality knowledge of the 72B-scale **Qwen2.5-Instruct** model.

This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.

The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor."

## Model Details

*   **Architecture:** Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`)
*   **Parameters:** ~72 Billion
*   **Layers:** 80
*   **Foundation:** `Qwen/Qwen3-32B`
*   **Donor:** `Qwen/Qwen2.5-72B-Instruct`
*   **Tokenizer:** `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`)

## Model Creation Process

The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.

### Phase 1: Foundation Upscaling

First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated **self-interpolation** script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.

### Phase 2: Donor Alignment

The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved:
1.  Creating an empty 80-layer model shell with the pure Qwen3 architecture.
2.  Surgically removing all `.bias` tensors from the Qwen2.5 weights.
3.  Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
4.  Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.

### Phase 3: Knowledge Transplant via MergeKit

With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.

The following `MergeKit` configuration was used:

```yaml
merge_method: linear
base_model: ./Qwen3-32B-Upscaled
dtype: bfloat16

slices:
  # Slice 1: Blend the bottom 32 layers
  - merge_method: linear
    sources:
    - model: ./Qwen3-32B-Upscaled
      layer_range: [0, 32]
      parameters:
        weight: 0.5
    - model: ./Qwen2.5-72B-Instruct-Aligned
      layer_range: [0, 32]
      parameters:
        weight: 0.5

  # Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
  - merge_method: passthrough
    sources:
    - model: ./Qwen2.5-72B-Instruct-Aligned
      layer_range: [32, 48]

  # Slice 3: Blend the top layers
  - merge_method: linear
    sources:
    - model: ./Qwen3-32B-Upscaled
      layer_range: [32, 64]
      parameters:
        weight: 0.5
    - model: ./Qwen2.5-72B-Instruct-Aligned
      layer_range: [48, 80]
      parameters:
        weight: 0.5

tokenizer_source: ./Qwen3-32B-Upscaled
```

## How to Use

This model uses the standard Qwen ChatML prompt format.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "cognitivecomputations/Qwen3-72B-Synthesis"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## Intended Use and Limitations

**This is an experimental model and should be considered a high-quality checkpoint, not a finished product.**

*   **Fine-tuning is highly recommended.** While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
*   The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
*   This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.

## Acknowledgements

This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard and Arcee.ai.