guava-05-18 (v8 LoRA adapter)

LoRA adapter for Qwen/Qwen3.5-4B, trained to act as a closed-loop tool-calling controller for a 7-DoF robot arm. Part of the guava project.

This is the v8 / 2026-05-18 checkpoint (checkpoint-342, best val loss 0.545, selected via load_best_model_at_end=True).

âš  Loading: use the multimodal auto-class

Qwen/Qwen3.5-4B is a vision-language model. Its decoder lives at model.language_model.*, and the LoRA target regex in this adapter points there. You must load it with AutoModelForImageTextToText (or Qwen3_5ForConditionalGeneration directly), NOT AutoModelForCausalLM.

Using AutoModelForCausalLM will:

  • raise ValueError: Target modules ... not found in the base model when PeftModel.from_pretrained(...) tries to attach the adapter, OR
  • (if you force-load via swap-in) blow up later with AttributeError: ... 'get_rope_idx' — the multimodal RoPE-index method only exists on the conditional-generation class.

Training data

data/version-8/training.jsonl — 14 robosuite tasks collected with SAM3 perception + a GraspGen 6-DoF planner. Trial format matches the inference loop: per-turn <image> + gripper-state observation, <think> reasoning, exactly one <tool_call>{name, arguments}, and a tool-response turn.

Task families:

  • collect/ (11): can_in_bin, close_drawer, cube_stack, cube_under_cup, hotdog_near_donut, milk_near_cup, open_drawer, pick_up_orange, push_basket, push_cereal, tomato_in_bowl
  • new_counterfactuals/ (3): apple_juice_order, bread_near_lemon, remove_chocolate_from_cube (main + branch trajectories)

Training hyperparameters

Base model Qwen/Qwen3.5-4B
Dtype bfloat16
Tuner LoRA
LoRA rank / alpha / dropout 16 / 32 / 0.05
Target modules all-linear (under model.language_model)
Frozen ViT, aligner (LM + LoRA trained)
Epochs 3
LR / schedule 2e-5 / cosine, 5% warmup
Weight decay / max grad norm 0.01 / 1.0
Per-device batch / grad accum 1 / 8 (effective 8 per device)
Max length 8192
Best metric val_loss = 0.545 @ step 342

System prompt

This adapter was trained against a specific compressed system prompt — see system_prompt.txt (~1.6K chars). Any inference must use this exact prompt; substituting a different one produces a noticeable distribution shift.

Usage

Transformers + PEFT (verified working)

import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel

base = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16, device_map="cuda",
)
model = PeftModel.from_pretrained(base, "AIcell/guava-05-18")
proc = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")

system_prompt = open("system_prompt.txt").read().strip()  # or pull from this repo
scene_img = Image.open("scene.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [
        {"type": "image", "image": scene_img},
        {"type": "text", "text":
            "Task: place the cup on the plate.\n\n"
            "Gripper is at [0.45, 0.0, 0.25] rotation [-176°, 0°, 90°] width 52%."},
    ]},
]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_dict=True, return_tensors="pt",
).to("cuda")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(proc.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Expected output is a <think>...</think> block followed by exactly one <tool_call>{...}</tool_call> (or Task complete. / Task failed. to terminate).

vLLM with LoRA serving

vllm serve Qwen/Qwen3.5-4B \
    --port 8000 --max-model-len 24576 \
    --reasoning-parser qwen3 --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice --enable-lora \
    --lora-modules guava-05-18=AIcell/guava-05-18 \
    --limit-mm-per-prompt '{"image": 20}'

Then send chat completions with model=guava-05-18 and the same message format as the transformers example (system prompt from system_prompt.txt, user turn with image + task description + gripper state).

Per-turn message format

Each assistant turn is a <think>…</think> block followed by exactly one of:

  • <tool_call>{"name": "<tool>", "arguments": {…}}</tool_call> — when control is to continue;
  • the literal line Task complete. or Task failed. — when the agent decides to terminate.

The system prompt enumerates the 9 available tools (grasp, align, move, rotate, close_gripper, release, home_pose, get_position, get_position_and_size).

Evaluation

Tested across three eval domains (5 episodes per task, 43 tasks total):

Tier Tasks Success Rate
ood-near (held-out robosuite) 9 18/45 40.0%
simpler (SimplerEnv) 25 17/125 13.6%
libero (LIBERO-PRO) 9 2/45 4.4%
OVERALL 43 37/215 17.2%

Dominant failure mode (≈ 50% of zero-success tasks): model_declared_done — agent terminates without verifying the actual outcome. Future training data should include verify-after-action patterns to address this.

Limitations

  • LIBERO and SimplerEnv are out-of-distribution domains (different cameras, robots, action spaces). LIBERO success is near-zero on most tasks despite some pick-and-place generalization.
  • Drawer / articulated-object tasks largely fail; training data has no prolonged-contact / sustained-pull patterns.
  • Widowx (SimplerEnv) is poorly supported — the 5-DoF arm cannot satisfy the PCA top-down grasp pose that the controller assumes.

Citation / source

Adapter weights, training script, and eval harness: https://github.com/hdacnw/guava

Downloads last month
36
Video Preview
loading

Model tree for AIcell/guava-05-18

Finetuned
Qwen/Qwen3.5-4B
Adapter
(187)
this model