Instructions to use AIcell/guava-05-18 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AIcell/guava-05-18 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/workspace/.hf_home/hub/models--Qwen--Qwen3.5-4B/snapshots/851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a") model = PeftModel.from_pretrained(base_model, "AIcell/guava-05-18") - Notebooks
- Google Colab
- Kaggle
guava-05-18 (v8 LoRA adapter)
LoRA adapter for Qwen/Qwen3.5-4B, trained to act as a closed-loop
tool-calling controller for a 7-DoF robot arm. Part of the
guava project.
This is the v8 / 2026-05-18 checkpoint (checkpoint-342, best val
loss 0.545, selected via load_best_model_at_end=True).
âš Loading: use the multimodal auto-class
Qwen/Qwen3.5-4B is a vision-language model. Its decoder lives at
model.language_model.*, and the LoRA target regex in this adapter
points there. You must load it with AutoModelForImageTextToText
(or Qwen3_5ForConditionalGeneration directly), NOT
AutoModelForCausalLM.
Using AutoModelForCausalLM will:
- raise
ValueError: Target modules ... not found in the base modelwhenPeftModel.from_pretrained(...)tries to attach the adapter, OR - (if you force-load via swap-in) blow up later with
AttributeError: ... 'get_rope_idx'— the multimodal RoPE-index method only exists on the conditional-generation class.
Training data
data/version-8/training.jsonl — 14 robosuite tasks collected with
SAM3 perception + a GraspGen 6-DoF planner. Trial format matches the
inference loop: per-turn <image> + gripper-state observation,
<think> reasoning, exactly one <tool_call>{name, arguments}, and a
tool-response turn.
Task families:
- collect/ (11): can_in_bin, close_drawer, cube_stack, cube_under_cup, hotdog_near_donut, milk_near_cup, open_drawer, pick_up_orange, push_basket, push_cereal, tomato_in_bowl
- new_counterfactuals/ (3): apple_juice_order, bread_near_lemon, remove_chocolate_from_cube (main + branch trajectories)
Training hyperparameters
| Base model | Qwen/Qwen3.5-4B |
| Dtype | bfloat16 |
| Tuner | LoRA |
| LoRA rank / alpha / dropout | 16 / 32 / 0.05 |
| Target modules | all-linear (under model.language_model) |
| Frozen | ViT, aligner (LM + LoRA trained) |
| Epochs | 3 |
| LR / schedule | 2e-5 / cosine, 5% warmup |
| Weight decay / max grad norm | 0.01 / 1.0 |
| Per-device batch / grad accum | 1 / 8 (effective 8 per device) |
| Max length | 8192 |
| Best metric | val_loss = 0.545 @ step 342 |
System prompt
This adapter was trained against a specific compressed system
prompt — see system_prompt.txt (~1.6K
chars). Any inference must use this exact prompt; substituting a
different one produces a noticeable distribution shift.
Usage
Transformers + PEFT (verified working)
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
base = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16, device_map="cuda",
)
model = PeftModel.from_pretrained(base, "AIcell/guava-05-18")
proc = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")
system_prompt = open("system_prompt.txt").read().strip() # or pull from this repo
scene_img = Image.open("scene.png").convert("RGB")
messages = [
{"role": "system", "content": [{"type": "text", "text": system_prompt}]},
{"role": "user", "content": [
{"type": "image", "image": scene_img},
{"type": "text", "text":
"Task: place the cup on the plate.\n\n"
"Gripper is at [0.45, 0.0, 0.25] rotation [-176°, 0°, 90°] width 52%."},
]},
]
inputs = proc.apply_chat_template(
messages, add_generation_prompt=True,
tokenize=True, return_dict=True, return_tensors="pt",
).to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(proc.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Expected output is a <think>...</think> block followed by exactly
one <tool_call>{...}</tool_call> (or Task complete. / Task failed. to terminate).
vLLM with LoRA serving
vllm serve Qwen/Qwen3.5-4B \
--port 8000 --max-model-len 24576 \
--reasoning-parser qwen3 --tool-call-parser qwen3_coder \
--enable-auto-tool-choice --enable-lora \
--lora-modules guava-05-18=AIcell/guava-05-18 \
--limit-mm-per-prompt '{"image": 20}'
Then send chat completions with model=guava-05-18 and the same
message format as the transformers example (system prompt from
system_prompt.txt, user turn with image + task description + gripper
state).
Per-turn message format
Each assistant turn is a <think>…</think> block followed by exactly
one of:
<tool_call>{"name": "<tool>", "arguments": {…}}</tool_call>— when control is to continue;- the literal line
Task complete.orTask failed.— when the agent decides to terminate.
The system prompt enumerates the 9 available tools (grasp, align, move, rotate, close_gripper, release, home_pose, get_position, get_position_and_size).
Evaluation
Tested across three eval domains (5 episodes per task, 43 tasks total):
| Tier | Tasks | Success | Rate |
|---|---|---|---|
| ood-near (held-out robosuite) | 9 | 18/45 | 40.0% |
| simpler (SimplerEnv) | 25 | 17/125 | 13.6% |
| libero (LIBERO-PRO) | 9 | 2/45 | 4.4% |
| OVERALL | 43 | 37/215 | 17.2% |
Dominant failure mode (≈ 50% of zero-success tasks):
model_declared_done — agent terminates without verifying the actual
outcome. Future training data should include verify-after-action
patterns to address this.
Limitations
- LIBERO and SimplerEnv are out-of-distribution domains (different cameras, robots, action spaces). LIBERO success is near-zero on most tasks despite some pick-and-place generalization.
- Drawer / articulated-object tasks largely fail; training data has no prolonged-contact / sustained-pull patterns.
- Widowx (SimplerEnv) is poorly supported — the 5-DoF arm cannot satisfy the PCA top-down grasp pose that the controller assumes.
Citation / source
Adapter weights, training script, and eval harness: https://github.com/hdacnw/guava
- Downloads last month
- 36