lightblue
/

Karasu-Mixtral-8x22B-v0.1

@@ -3,6 +3,14 @@ library_name: transformers
 tags: []
 ---
 # How to use
 We have tested (and thus recommend) running this model on vLLM. We recommend running it from the vLLM openAI server, using the following command:
@@ -59,6 +67,7 @@ We will be uploading a 4bit AWQ model soon to make it easier to run this model o
 # Inference examples
 <details>
   <summary>Creative prompts</summary>
@@ -335,6 +344,124 @@ Ces joueurs sont souvent cités comme étant parmi les meilleurs du monde, mais
 </details>
 # Developers

 tags: []
 ---
+# Model overview
+This is a QLoRA finetune of the newly released [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1) base model.
+As the base model has not explicitly been trained to chat, we trained this model on a multilingual chat dataset so that the LLM community can use this model for conversations.
+The accuracy of the model is surprisingly high, and has a decently fast inference speed (roughly 40 tokens/s single batch on our tests), so we believe this will be useful to the community.
 # How to use
 We have tested (and thus recommend) running this model on vLLM. We recommend running it from the vLLM openAI server, using the following command:
 # Inference examples
+From qualitative testing, the model seems pretty smart, especially in English, and has very good recall of facts. It can still get confused with some logical questions, but has also passed a lot of the logical questions I have thrown at it that other open source LLMs often fail.
 <details>
   <summary>Creative prompts</summary>
 </details>
+# Training dataset
+We trained this model on conversations between human users and GPT-4.
+This consists of two datasets:
+* 6,206 conversations from the [openchat/openchat_sharegpt4_dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset) dataset ([link](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json?download=true))
+* 3,011 conversations that we created. We wanted to increase the representation of non_english prompts in our training dataset, so we sampled initial prompts from [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), stratifying based on language. We then prompted gpt-4-0125 with these, and used the results as training data.
+We plan to release more information on this second dataset soon, as we are using it another dataset.
+The complete data used to train this model can be found at [lightblue/gpt4_conversations_multilingual](https://huggingface.co/datasets/lightblue/gpt4_conversations_multilingual)
+# Training details
+We trained this model using Axolotl for roughly 100 minutes in a A100 (80GB) x 4 environment on the Azure cloud (Standard_NC96ads_A100_v4).
+We used Deepspeed Zero2 to effectively train over 4 GPUs.
+We used the following config to train the model:
+<details>
+  <summary>Training config</summary>
+```yaml
+base_model: mistral-community/Mixtral-8x22B-v0.1
+model_type: AutoModelForCausalLM
+tokenizer_type: AutoTokenizer
+trust_remote_code: true
+load_in_8bit: false
+load_in_4bit: true
+strict: false
+datasets:
+  - path: lightblue/gpt4_conversations_multilingual
+    type: sharegpt
+    conversation: mistral
+dataset_prepared_path: ./prepared_dataset_2048-multiling
+val_set_size: 0
+output_dir: ./qlora-out-2048-multiling
+## You can optionally freeze the entire model and unfreeze a subset of parameters
+unfrozen_parameters:
+#  - ^lm_head.weight$
+#  - ^model.embed_tokens.weight$[:32000]
+#  - model.layers.2[0-9]+.block_sparse_moe.gate
+#  - model.layers.2[0-9]+.block_sparse_moe.experts
+#  - model.layers.3[0-9]+.block_sparse_moe.gate
+#  - model.layers.3[0-9]+.block_sparse_moe.experts
+model_config:
+  output_router_logits: true
+adapter: qlora
+lora_model_dir:
+sequence_len: 2048
+sample_packing: true
+pad_to_sequence_len: true
+lora_r: 16
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_linear: true
+lora_fan_in_fan_out:
+#lora_target_modules:
+#  - gate
+#  - q_proj
+#  - k_proj
+#  - v_proj
+#  - o_proj
+#  - w1
+#  - w2
+#  - w3
+gradient_accumulation_steps: 2
+micro_batch_size: 1
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+use_wandb: true
+wandb_project: wandb_project
+wandb_entity: wandb_entity
+wandb_name: wandb_name
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+warmup_steps: 10
+evals_per_epoch: 0
+eval_table_size:
+eval_max_new_tokens: 128
+saves_per_epoch: 5
+debug:
+deepspeed: /workspace/axolotl/deepspeed_configs/zero2.json
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+```
+</details>
 # Developers