🤖 gama-4b

gama-4b is an efficient 4-billion parameter language model, specially optimized for multilingual conversation with a focus on Portuguese and English. This model combines specialized capabilities through a strategic merge of complementary models.

📋 Overview

This model was developed using the DARE TIES (Drop And REscale with Ties-Elimination) technique, combining specialized models to create a compact and versatile solution for conversational applications in Portuguese and English.

🌟 Key Features

💬 Bilingual: Optimized for Brazilian Portuguese and English
⚡ Efficient: Only 4B parameters for fast deployment
🔧 Quantized: QAT for better performance/size

🔧 Base Models Used

gama-4b is the result of a strategic merge of the following models:

🛠️ Merge Tool

The merge was performed using LazyMergekit, facilitating the process of merging language models with advanced configurations.

⚙️ Technical Configuration

Merge Parameters

models:
  - model: CEIA-UFG/Gemma-3-Gaia-PT-BR-4b-it
    parameters:
      density: 0.6
      weight: 0.34

  - model: soob3123/Veiled-Calla-4B
    parameters:
      density: 0.6
      weight: 0.33

  - model: soob3123/amoral-gemma3-4B-v2-qat
    parameters:
      density: 0.6
      weight: 0.33

merge_method: dare_ties
base_model: unsloth/gemma-3-4b-it-qat

parameters:
  normalize: true
  int8_mask: true

dtype: bfloat16

Technical Specifications

Architecture: Gemma-3 4B
Merge Method: DARE TIES
Precision: BFloat16
Quantization: QAT (Quantization Aware Training)
Normalization: Enabled
Int8 Mask: Enabled
Languages: Portuguese (PT-BR) and English

💻 How to Use

Installing Dependencies

pip install -qU transformers accelerate torch

Basic Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

# Model configuration
model_name = "rodrigomt/gama-4b"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Example in Portuguese
messages_pt = [
    {"role": "user", "content": "What is a large language model?"}
]

# Example in English
messages_en = [
    {"role": "user", "content": "What is a large language model?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages_pt,
    tokenize=False,
    add_generation_prompt=True
)

# Pipeline configuration
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Text generation
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.1
)

print(outputs[0]["generated_text"])

Multilingual Usage Example

# Conversation switching languages
conversation = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hello! I'm doing well, thank you for asking. How can I help you today?"},
    {"role": "user", "content": "Can you switch to English?"},
    {"role": "assistant", "content": "Of course! I can communicate in both Portuguese and English. How can I help you?"}
]

prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=128, temperature=0.7)
print(outputs[0]["generated_text"])

Advanced Usage Example

# For more granular control over generation
def generate_response(prompt_text, max_tokens=256, temperature=0.7):
    inputs = tokenizer.encode(prompt_text, return_tensors="pt")
    attention_mask = inputs.ne(tokenizer.pad_token_id)

    with torch.no_grad():
        outputs = model.generate(
            inputs,
            attention_mask=attention_mask,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_k=50,
            top_p=0.95,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Using the function
response = generate_response("Explain machine learning in simple terms:")
print(response)

⚠️ System Requirements

Minimum Configuration

RAM: 16GB
VRAM: 8GB (GPU)
Storage: 20GB available
GPU: GTX 3070 or higher

Recommended Configuration

RAM: 32GB
VRAM: 16GB (GPU)
GPU: RTX 4070, A4000 or higher
CPU: Modern multi-core processor

🔧 Advanced Settings

Temperature Adjustment

# More creative responses
outputs = pipeline(prompt, temperature=0.9, top_p=0.95)

# More conservative responses
outputs = pipeline(prompt, temperature=0.3, top_k=30)

Repetition Control

# Reduce repetitions
outputs = pipeline(prompt, repetition_penalty=1.2, no_repeat_ngram_size=3)

📝 License

This model is licensed under the Gemma License.

rodrigomt
/

gama-4b