File size: 6,624 Bytes

f5414e8

---
library_name: keras-hub
---
### Model Overview
# Model Summary

Qwen1.5-MoE is a transformer-based MoE decoder-only language model pre-trained on a large amount of data. Qwen1.5-MoE employs a Mixture of Experts (MoE) architecture, where the models are upcycled from dense language models. For instance, Qwen1.5-MoE-A2.7B is upcycled from Qwen-1.8B. It has 14.3B parameters in total and 2.7B activated parameters during runtime, while achieving comparable performance to Qwen1.5-7B, it only requires 25% of the training resources.

Weights are released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE) . Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).

## Links

* [Qwen 1.5 MoE Quickstart Notebook](https://colab.sandbox.google.com/gist/laxmareddyp/45cb05fbf5d15380297fb8017b181efc/qwenmoe_quickstart.ipynb)
* [Qwen 1.5 MoE API Documentation](https://keras.io/keras_hub/api/models/qwen_moe/)
* [Qwen 1.5 MoE Model Card](https://qwenlm.github.io/blog/qwen-moe/)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Installation

Keras and KerasHub can be installed with:

```
pip install -U -q keras-hub
pip install -U -q keras
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

## Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

| Preset name                            | Parameters | Description                                                                                                  |
|---------------------------------------|------------|--------------------------------------------------------------------------------------------------------------|
|  qwen1.5_moe_2.7b_en       | 2.7B       | 24-layer Qwen MoE model with 2.7 billion parameters and 8 experts per MoE layer. |

## Example Usage
```Python

import keras
import keras_hub
import numpy as np

# Basic text generation with Qwen MoE
qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset("qwen1.5_moe_2.7b_en")
qwen_moe.generate("I want to say", max_length=30)

# Batch generation with multiple prompts
qwen_moe.generate(["This is a", "Where are you"], max_length=30)

# Using different sampling strategies
qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset("qwen1.5_moe_2.7b_en")
# Greedy sampling
qwen_moe.compile(sampler="greedy")
qwen_moe.generate("I want to say", max_length=30)
# Beam search with MoE-specific parameters
qwen_moe.compile(
    sampler=keras_hub.samplers.BeamSampler(
        num_beams=2,
        decoder_sparse_step=2,  # MoE-specific: control expert usage frequency
        top_k_experts=2,        # MoE-specific: number of experts to use per token
    )
)
qwen_moe.generate("I want to say", max_length=30)

# Generate without preprocessing
prompt = {
    "token_ids": np.array([[15191, 374, 0, 0, 0]] * 2),
    "padding_mask": np.array([[1, 1, 0, 0, 0]] * 2),
}

qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset(
    "qwen1.5_moe_2.7b_en",
    preprocessor=None,
)
qwen_moe.generate(
    prompt,
    num_experts=8,           # Total number of experts per layer
    top_k_experts=2,         # Number of experts to use per token
    decoder_sparse_step=2,   # Control expert usage frequency
)

# Training on a single batch
features = ["The quick brown fox jumped.", "I forgot my homework."]
qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset("qwen1.5_moe_2.7b_en")
qwen_moe.fit(
    x=features,
    batch_size=2,
    router_aux_loss_coefficient=0.01,  # MoE-specific: aux loss for router training
)

# Training without preprocessing
x = {
    "token_ids": np.array([[1, 2, 3, 4, 5]] * 2),
    "padding_mask": np.array([[1, 1, 1, 1, 1]] * 2),
}
y = np.array([[2, 3, 4, 5, 0]] * 2)
sw = np.array([[1, 1, 1, 1, 1]] * 2)

qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset(
    "qwen1.5_moe_2.7b_en",
    preprocessor=None,
)
qwen_moe.fit(
    x=x,
    y=y,
    sample_weight=sw,
    batch_size=2,
    router_aux_loss_coefficient=0.01,  # MoE-specific: aux loss weight
)

```

## Example Usage with Hugging Face URI

```Python

import keras
import keras_hub
import numpy as np

# Basic text generation with Qwen MoE
qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset("hf://keras/qwen1.5_moe_2.7b_en")
qwen_moe.generate("I want to say", max_length=30)

# Batch generation with multiple prompts
qwen_moe.generate(["This is a", "Where are you"], max_length=30)

# Using different sampling strategies
qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset("hf://keras/qwen1.5_moe_2.7b_en")
# Greedy sampling
qwen_moe.compile(sampler="greedy")
qwen_moe.generate("I want to say", max_length=30)
# Beam search with MoE-specific parameters
qwen_moe.compile(
    sampler=keras_hub.samplers.BeamSampler(
        num_beams=2,
        decoder_sparse_step=2,  # MoE-specific: control expert usage frequency
        top_k_experts=2,        # MoE-specific: number of experts to use per token
    )
)
qwen_moe.generate("I want to say", max_length=30)

# Generate without preprocessing
prompt = {
    "token_ids": np.array([[15191, 374, 0, 0, 0]] * 2),
    "padding_mask": np.array([[1, 1, 0, 0, 0]] * 2),
}

qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset(
    "hf://keras/qwen1.5_moe_2.7b_en",
    preprocessor=None,
)
qwen_moe.generate(
    prompt,
    num_experts=8,           # Total number of experts per layer
    top_k_experts=2,         # Number of experts to use per token
    decoder_sparse_step=2,   # Control expert usage frequency
)

# Training on a single batch
features = ["The quick brown fox jumped.", "I forgot my homework."]
qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset("hf://keras/qwen1.5_moe_2.7b_en")
qwen_moe.fit(
    x=features,
    batch_size=2,
    router_aux_loss_coefficient=0.01,  # MoE-specific: aux loss for router training
)

# Training without preprocessing
x = {
    "token_ids": np.array([[1, 2, 3, 4, 5]] * 2),
    "padding_mask": np.array([[1, 1, 1, 1, 1]] * 2),
}
y = np.array([[2, 3, 4, 5, 0]] * 2)
sw = np.array([[1, 1, 1, 1, 1]] * 2)

qwen_moe = keras_hub.models.QwenMoeCausalLM.from_preset(
    "hf://keras/qwen1.5_moe_2.7b_en",
    preprocessor=None,
)
qwen_moe.fit(
    x=x,
    y=y,
    sample_weight=sw,
    batch_size=2,
    router_aux_loss_coefficient=0.01,  # MoE-specific: aux loss weight
)

```