File size: 4,456 Bytes

975380a

---
license: apache-2.0
tags:
- vision
---


# VisualSplit

**VisualSplit** is a ViT-based model that explicitly factorises an image into **classical visual descriptors**—such as **edges**, **color segmentation**, and **grayscale histogram**—and learns to reconstruct the image conditioned on those descriptors. This design yields **interpretable representations** where geometry (edges), albedo/appearance (segmented colors), and global tone (histogram) can be reasoned about or varied independently.

> **Training data**: ImageNet-1K.  
---

## Model Description

- **Inputs** (at inference):
  - An RGB image (for convenience) which is converted to descriptors using the provided `FeatureExtractor` (edges, color segmentation, grayscale histogram).
- **Outputs**:
  - A reconstructed RGB image tensor (same spatial size as the model’s training resolution; default `224×224` unless you trained otherwise).

---

## Getting Started (Inference)

Below are two ways to run inference with the uploaded `model.safetensors`.

### 1) Minimal PyTorch + safetensors (load state dict)

```python
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# 1) Import your model & config from the VisualSplit repo
from visualsplit.models.CrossViT import CrossViTForPreTraining, CrossViTConfig
from visualsplit.utils import FeatureExtractor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 2) Build a config matching your training (edit if you changed widths/depths)
config = CrossViTConfig(
    image_size=224,           # change if your training size differs
    patch_size=16,
    # ... any other config fields your repo exposes
)

model = CrossViTForPreTraining(config).to(device)
model.eval()

# 3) Download and load state dict from this model repo
#    Replace REPO_ID with your Hugging Face model id, e.g. "HenryQUQ/visualsplit")
ckpt_path = hf_hub_download(repo_id="REPO_ID", filename="model.safetensors")
state_dict = load_file(ckpt_path)
missing, unexpected = model.load_state_dict(state_dict, strict=False)
print("Missing keys:", missing)
print("Unexpected keys:", unexpected)

# 4) Prepare an input image and extract descriptors
from PIL import Image
from torchvision import transforms

image = Image.open("input.jpg").convert("RGB")
transform = transforms.Compose([
    transforms.Resize((config.image_size, config.image_size)),
    transforms.ToTensor(),
])
pixel_values = transform(image).unsqueeze(0).to(device)   # (1, 3, H, W)

# FeatureExtractor provided by the repo should return the required tensors
extractor = FeatureExtractor().to(device)
with torch.no_grad():
    edge, gray_hist, segmented_rgb, _ = extractor(pixel_values)

# 5) Run inference (reconstruction)
with torch.no_grad():
    outputs = model(
        source_edge=edge,
        source_gray_level_histogram=gray_hist,
        source_segmented_rgb=segmented_rgb,
    )
# Your repo’s forward returns may differ; adjust the key accordingly:
reconstructed = outputs["logits_reshape"]  # (1, 3, H, W)

# 6) Convert to PIL for visualisation
to_pil = transforms.ToPILImage()
recon_img = to_pil(reconstructed.squeeze(0).cpu().clamp(0, 1))
recon_img.save("reconstructed.png")
print("Saved to reconstructed.png")
```

### 2) Reproducing the notebook flow (`notebook/validation.ipynb`)

The repository provides a validation notebook that:
1. Loads the trained model,
2. Uses `FeatureExtractor` to compute **edges**, **color-segmented RGB**, and **grayscale histograms**,
3. Runs the model to obtain a reconstructed image,
4. Saves/visualises the result.

---

## Installation & Requirements

```bash
# clone the VisualSplit code
git clone https://github.com/HenryQUQ/VisualSplit.git
cd VisualSplit
# pip install -e .
```

---

## Training Data

- **Dataset**: **ImageNet-1K**.
- 
> This repository only hosts the **trained checkpoint for inference**. Follow the GitHub repo for the full training pipeline and data preparation scripts.

---

## Model Sources

- **Code**: https://github.com/HenryQUQ/VisualSplit  
- **Weights (this page)**: this Hugging Face model repo

---

## Citation

If you use this model or ideas, please cite:

```bibtex
@inproceedings{Qu2025VisualSplit,
  title     = {Exploring Image Representation with Decoupled Classical Visual Descriptors},
  author    = {Qu, Chenyuan and Chen, Hao and Jiao, Jianbo},
  booktitle = {British Machine Vision Conference (BMVC)},
  year      = {2025}
}
```

---