File size: 4,456 Bytes
975380a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: apache-2.0
tags:
- vision
---


# VisualSplit

**VisualSplit** is a ViT-based model that explicitly factorises an image into **classical visual descriptors**—such as **edges**, **color segmentation**, and **grayscale histogram**—and learns to reconstruct the image conditioned on those descriptors. This design yields **interpretable representations** where geometry (edges), albedo/appearance (segmented colors), and global tone (histogram) can be reasoned about or varied independently.

> **Training data**: ImageNet-1K.  
---

## Model Description

- **Inputs** (at inference):
  - An RGB image (for convenience) which is converted to descriptors using the provided `FeatureExtractor` (edges, color segmentation, grayscale histogram).
- **Outputs**:
  - A reconstructed RGB image tensor (same spatial size as the model’s training resolution; default `224×224` unless you trained otherwise).

---

## Getting Started (Inference)

Below are two ways to run inference with the uploaded `model.safetensors`.

### 1) Minimal PyTorch + safetensors (load state dict)

```python
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# 1) Import your model & config from the VisualSplit repo
from visualsplit.models.CrossViT import CrossViTForPreTraining, CrossViTConfig
from visualsplit.utils import FeatureExtractor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 2) Build a config matching your training (edit if you changed widths/depths)
config = CrossViTConfig(
    image_size=224,           # change if your training size differs
    patch_size=16,
    # ... any other config fields your repo exposes
)

model = CrossViTForPreTraining(config).to(device)
model.eval()

# 3) Download and load state dict from this model repo
#    Replace REPO_ID with your Hugging Face model id, e.g. "HenryQUQ/visualsplit")
ckpt_path = hf_hub_download(repo_id="REPO_ID", filename="model.safetensors")
state_dict = load_file(ckpt_path)
missing, unexpected = model.load_state_dict(state_dict, strict=False)
print("Missing keys:", missing)
print("Unexpected keys:", unexpected)

# 4) Prepare an input image and extract descriptors
from PIL import Image
from torchvision import transforms

image = Image.open("input.jpg").convert("RGB")
transform = transforms.Compose([
    transforms.Resize((config.image_size, config.image_size)),
    transforms.ToTensor(),
])
pixel_values = transform(image).unsqueeze(0).to(device)   # (1, 3, H, W)

# FeatureExtractor provided by the repo should return the required tensors
extractor = FeatureExtractor().to(device)
with torch.no_grad():
    edge, gray_hist, segmented_rgb, _ = extractor(pixel_values)

# 5) Run inference (reconstruction)
with torch.no_grad():
    outputs = model(
        source_edge=edge,
        source_gray_level_histogram=gray_hist,
        source_segmented_rgb=segmented_rgb,
    )
# Your repo’s forward returns may differ; adjust the key accordingly:
reconstructed = outputs["logits_reshape"]  # (1, 3, H, W)

# 6) Convert to PIL for visualisation
to_pil = transforms.ToPILImage()
recon_img = to_pil(reconstructed.squeeze(0).cpu().clamp(0, 1))
recon_img.save("reconstructed.png")
print("Saved to reconstructed.png")
```

### 2) Reproducing the notebook flow (`notebook/validation.ipynb`)

The repository provides a validation notebook that:
1. Loads the trained model,
2. Uses `FeatureExtractor` to compute **edges**, **color-segmented RGB**, and **grayscale histograms**,
3. Runs the model to obtain a reconstructed image,
4. Saves/visualises the result.

---

## Installation & Requirements

```bash
# clone the VisualSplit code
git clone https://github.com/HenryQUQ/VisualSplit.git
cd VisualSplit
# pip install -e .
```

---

## Training Data

- **Dataset**: **ImageNet-1K**.
- 
> This repository only hosts the **trained checkpoint for inference**. Follow the GitHub repo for the full training pipeline and data preparation scripts.

---

## Model Sources

- **Code**: https://github.com/HenryQUQ/VisualSplit  
- **Weights (this page)**: this Hugging Face model repo

---

## Citation

If you use this model or ideas, please cite:

```bibtex
@inproceedings{Qu2025VisualSplit,
  title     = {Exploring Image Representation with Decoupled Classical Visual Descriptors},
  author    = {Qu, Chenyuan and Chen, Hao and Jiao, Jianbo},
  booktitle = {British Machine Vision Conference (BMVC)},
  year      = {2025}
}
```

---