File size: 6,139 Bytes
9bf1d31
 
 
 
 
 
6762d09
9bf1d31
 
 
 
6762d09
 
9bf1d31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---

title: AutoQuantNX
app_file: app.py
sdk: gradio
sdk_version: 4.44.1
---

# πŸ€— [AutoQuantNX](https://huggingface.co/spaces/smokxy/AutoQuantNX)

## Overview
AutoQuantNX is a powerful Gradio-based web application designed to simplify the process of optimizing and deploying Hugging Face models. It supports a wide range of tasks, including quantization, ONNX conversion, and seamless integration with the Hugging Face Hub. With AutoQuantNX, you can easily convert models to ONNX format, apply quantization techniques, and push the optimized models to your Hugging Face accountβ€”all through an intuitive user interface.

## ```In the deployed UI, only 16 Bit quantization works because of GPU requirement of BitsAndBytes and no GPU availability in free HF space.```

## Features

### Supported Tasks
AutoQuantNX supports the following tasks:

* Text Classification
* Named Entity Recognition (NER)
* Question Answering
* Causal Language Modeling
* Masked Language Modeling
* Sequence-to-Sequence Language Modeling
* Multiple Choice
* Whisper (Speech-to-Text)
* Embedding Fine-Tuning
* Image Classification (Placeholder for future implementation)

### Quantization Options
* None (default)
* 4-bit
* 8-bit
* 16-bit-float

### ONNX Conversion
Converts models to ONNX format for optimized deployment.

Supports optional ONNX quantization:
* 8-bit
* 16-bit-int
* 16-bit-float

### Hugging Face Hub Integration
* Automatically pushes optimized models to your Hugging Face Hub repository
* Tags models with metadata for easy identification (e.g., onnx, quantized, task type)

### Performance Testing
Compares original and quantized models using metrics like:
* Mean Squared Error (MSE)
* Spearman Correlation
* Cosine Similarity
* Inference Time
* Model Size

## File Structure
```

AutoQuantNX/

β”œβ”€β”€ src/

β”‚   β”œβ”€β”€ handlers/

β”‚   β”‚   β”œβ”€β”€ audio_models/

β”‚   β”‚   β”‚   └── whisper_handler.py

β”‚   β”‚   β”œβ”€β”€ img_models/

β”‚   β”‚   β”‚   └── image_classification_handler.py

β”‚   β”‚   β”œβ”€β”€ nlp_models/

β”‚   β”‚   β”‚   β”œβ”€β”€ causal_lm_handler.py

β”‚   β”‚   β”‚   β”œβ”€β”€ embedding_model_handler.py

β”‚   β”‚   β”‚   β”œβ”€β”€ masked_lm_handler.py

β”‚   β”‚   β”‚   β”œβ”€β”€ multiple_choice_handler.py

β”‚   β”‚   β”‚   β”œβ”€β”€ question_answering_handler.py

β”‚   β”‚   β”‚   β”œβ”€β”€ seq2seq_lm_handler.py

β”‚   β”‚   β”‚   β”œβ”€β”€ sequence_classification_handler.py

β”‚   β”‚   β”‚   └── token_classification_handler.py

β”‚   β”‚   β”œβ”€β”€ __init__.py

β”‚   β”‚   └── base_handler.py

β”‚   β”œβ”€β”€ optimizations/

β”‚   β”‚   β”œβ”€β”€ onnx_conversion.py

β”‚   β”‚   └── quantize.py

β”‚   └── utilities/

β”‚       β”œβ”€β”€ push_to_hub.py

β”‚       └── resources.py

β”œβ”€β”€ README.md

β”œβ”€β”€ app.py

β”œβ”€β”€ poetry.lock

β”œβ”€β”€ pyproject.toml

└── requirements.txt

```

## Prerequisites

### Using requirements.txt (Not preferable to me atleast)
* Python 3.8 or higher
* Install dependencies:
  ```bash

  pip install -r requirements.txt

  ```

### Using Poetry
1. Install Poetry (if not already installed):
   
   Linux:
   ```bash

   curl -sSL https://install.python-poetry.org | python3 -

   ```
   Other platforms: Follow the official instructions.

2. Install dependencies:
   ```bash

   poetry install

   ```

3. Activate the virtual environment:
   ```bash

   poetry shell

   ```

## Usage

### Launch the App
Run the following command to start the Gradio web application:
```bash

python src/app.py

```
The app will be accessible at http://localhost:7860 by default.

### Steps to Use the App
1. Enter Model Details:
   * Provide the Hugging Face model name
   * Select the task type (e.g., text classification, question answering)

2. Select Optimization Options:
   * Choose quantization type (e.g., 4-bit, 8-bit)
   * Enable ONNX conversion and select quantization options if needed

3. Provide Hugging Face Token:
   * Enter your Hugging Face token for accessing and pushing models to the Hub

4. Start Conversion:
   * Click the "Start Conversion" button to process the model

5. Monitor Progress:
   * View real-time status updates, resource usage, and results directly in the app

6. Push to Hub:
   * Optimized models are automatically pushed to your specified Hugging Face repository

### Example
For a model like bert-base-uncased performing text classification:
1. Select text_classification as the task

2. Enable quantization (e.g., 8-bit)

3. Enable ONNX conversion with optimization

4. Click "Start Conversion" and monitor progress



## Key Functions



### app.py

* `process_model`: Main function handling model quantization, ONNX conversion, and Hugging Face Hub integration
* `update_memory_info`: Monitors and displays system resource usage

### optimization/onnx_conversion.py

* `convert_to_onnx`: Converts models to ONNX format

* `quantize_onnx_model`: Quantizes ONNX models for optimized inference



### optimization/quantize.py

* `ModelQuantizer`: Handles quantization of PyTorch models and performance testing



### utilities/push_to_hub.py

* `push_to_hub`: Pushes models to the Hugging Face Hub



### utilities/resources.py

* `ResourceManager`: Manages temporary files and memory usage



## Notes

* Ensure you have sufficient system resources for model conversion and quantization

* Use a Hugging Face Hub token with proper write permissions for pushing models



## Troubleshooting

* Model Conversion Fails: Ensure the model and task are supported

* Insufficient Resources: Free up memory or reduce optimization levels

* ONNX Quantization Errors: Verify that the selected quantization type is supported for the model



## License

This project is licensed under the MIT License. See the LICENSE file for details.



## Contributions

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.



## Acknowledgments

* Hugging Face Transformers

* Optimum Library

* Gradio

* ONNX Runtime