Spaces:

smokxy
/

AutoQuantNX

Sleeping

App Files Files Community

smokxy commited on Feb 11

Commit

9bf1d31

verified ·

1 Parent(s): fe21c85

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

.gitattributes +2 -35
README.md +190 -12
app.py +211 -0
poetry.lock +0 -0
pyproject.toml +53 -0
requirements.txt +167 -0
src/handlers/__init__.py +84 -0
src/handlers/audio_models/whisper_handler.py +14 -0
src/handlers/base_handler.py +148 -0
src/handlers/img_models/image_classification_handler.py +14 -0
src/handlers/nlp_models/causal_lm_handler.py +46 -0
src/handlers/nlp_models/embedding_model_handler.py +49 -0
src/handlers/nlp_models/masked_lm_handler.py +39 -0
src/handlers/nlp_models/multiple_choice_handler.py +39 -0
src/handlers/nlp_models/question_answering_handler.py +59 -0
src/handlers/nlp_models/seq2seq_lm_handler.py +34 -0
src/handlers/nlp_models/sequence_classification_handler.py +49 -0
src/handlers/nlp_models/token_classification_handler.py +57 -0
src/optimizations/onnx_conversion.py +64 -0
src/optimizations/quantize.py +109 -0
src/utilities/push_to_hub.py +85 -0
src/utilities/resources.py +54 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1,2 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text


1	+ # Auto detect text files and perform LF normalization
2	+ * text=auto

README.md CHANGED Viewed

@@ -1,12 +1,190 @@
----
-title: AutoQuantNX
-emoji: 🐢
-colorFrom: purple
-colorTo: green
-sdk: gradio
-sdk_version: 5.15.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: AutoQuantNX
+app_file: app.py
+sdk: gradio
+sdk_version: 4.44.1
+---
+# 🤗 AutoQuantNX (**Still under testing and improvement phase**)
+## Overview
+AutoQuantNX is a powerful Gradio-based web application designed to simplify the process of optimizing and deploying Hugging Face models. It supports a wide range of tasks, including quantization, ONNX conversion, and seamless integration with the Hugging Face Hub. With AutoQuantNX, you can easily convert models to ONNX format, apply quantization techniques, and push the optimized models to your Hugging Face account—all through an intuitive user interface.
+## Features
+### Supported Tasks
+AutoQuantNX supports the following tasks:
+* Text Classification
+* Named Entity Recognition (NER)
+* Question Answering
+* Causal Language Modeling
+* Masked Language Modeling
+* Sequence-to-Sequence Language Modeling
+* Multiple Choice
+* Whisper (Speech-to-Text)
+* Embedding Fine-Tuning
+* Image Classification (Placeholder for future implementation)
+### Quantization Options
+* None (default)
+* 4-bit
+* 8-bit
+* 16-bit-float
+### ONNX Conversion
+Converts models to ONNX format for optimized deployment.
+Supports optional ONNX quantization:
+* 8-bit
+* 16-bit-int
+* 16-bit-float
+### Hugging Face Hub Integration
+* Automatically pushes optimized models to your Hugging Face Hub repository
+* Tags models with metadata for easy identification (e.g., onnx, quantized, task type)
+### Performance Testing
+Compares original and quantized models using metrics like:
+* Mean Squared Error (MSE)
+* Spearman Correlation
+* Cosine Similarity
+* Inference Time
+* Model Size
+## File Structure
+```
+AutoQuantNX/
+├── src/
+│   ├── handlers/
+│   │   ├── audio_models/
+│   │   │   └── whisper_handler.py
+│   │   ├── img_models/
+│   │   │   └── image_classification_handler.py
+│   │   ├── nlp_models/
+│   │   │   ├── causal_lm_handler.py
+│   │   │   ├── embedding_model_handler.py
+│   │   │   ├── masked_lm_handler.py
+│   │   │   ├── multiple_choice_handler.py
+│   │   │   ├── question_answering_handler.py
+│   │   │   ├── seq2seq_lm_handler.py
+│   │   │   ├── sequence_classification_handler.py
+│   │   │   └── token_classification_handler.py
+│   │   ├── __init__.py
+│   │   └── base_handler.py
+│   ├── optimizations/
+│   │   ├── onnx_conversion.py
+│   │   └── quantize.py
+│   └── utilities/
+│       ├── push_to_hub.py
+│       └── resources.py
+├── README.md
+├── app.py
+├── poetry.lock
+├── pyproject.toml
+└── requirements.txt
+```
+## Prerequisites
+### Using requirements.txt (Not preferable to me atleast)
+* Python 3.8 or higher
+* Install dependencies:
+  ```bash
+  pip install -r requirements.txt
+  ```
+### Using Poetry
+1. Install Poetry (if not already installed):
+   Linux:
+   ```bash
+   curl -sSL https://install.python-poetry.org | python3 -
+   ```
+   Other platforms: Follow the official instructions.
+2. Install dependencies:
+   ```bash
+   poetry install
+   ```
+3. Activate the virtual environment:
+   ```bash
+   poetry shell
+   ```
+## Usage
+### Launch the App
+Run the following command to start the Gradio web application:
+```bash
+python src/app.py
+```
+The app will be accessible at http://localhost:7860 by default.
+### Steps to Use the App
+1. Enter Model Details:
+   * Provide the Hugging Face model name
+   * Select the task type (e.g., text classification, question answering)
+2. Select Optimization Options:
+   * Choose quantization type (e.g., 4-bit, 8-bit)
+   * Enable ONNX conversion and select quantization options if needed
+3. Provide Hugging Face Token:
+   * Enter your Hugging Face token for accessing and pushing models to the Hub
+4. Start Conversion:
+   * Click the "Start Conversion" button to process the model
+5. Monitor Progress:
+   * View real-time status updates, resource usage, and results directly in the app
+6. Push to Hub:
+   * Optimized models are automatically pushed to your specified Hugging Face repository
+### Example
+For a model like bert-base-uncased performing text classification:
+1. Select text_classification as the task
+2. Enable quantization (e.g., 8-bit)
+3. Enable ONNX conversion with optimization
+4. Click "Start Conversion" and monitor progress
+## Key Functions
+### app.py
+* `process_model`: Main function handling model quantization, ONNX conversion, and Hugging Face Hub integration
+* `update_memory_info`: Monitors and displays system resource usage
+### optimization/onnx_conversion.py
+* `convert_to_onnx`: Converts models to ONNX format
+* `quantize_onnx_model`: Quantizes ONNX models for optimized inference
+### optimization/quantize.py
+* `ModelQuantizer`: Handles quantization of PyTorch models and performance testing
+### utilities/push_to_hub.py
+* `push_to_hub`: Pushes models to the Hugging Face Hub
+### utilities/resources.py
+* `ResourceManager`: Manages temporary files and memory usage
+## Notes
+* Ensure you have sufficient system resources for model conversion and quantization
+* Use a Hugging Face Hub token with proper write permissions for pushing models
+## Troubleshooting
+* Model Conversion Fails: Ensure the model and task are supported
+* Insufficient Resources: Free up memory or reduce optimization levels
+* ONNX Quantization Errors: Verify that the selected quantization type is supported for the model
+## License
+This project is licensed under the MIT License. See the LICENSE file for details.
+## Contributions
+Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
+## Acknowledgments
+* Hugging Face Transformers
+* Optimum Library
+* Gradio
+* ONNX Runtime

app.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import gradio as gr
+import logging
+from typing import Tuple, Dict, Any
+from src.utilities.resources import ResourceManager
+from src.utilities.push_to_hub import push_to_hub
+from src.optimizations.onnx_conversion import convert_to_onnx
+from src.optimizations.quantize import quantize_onnx_model
+from src.handlers import get_model_handler, TASK_CONFIGS
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+import json
+def process_model(
+    model_name: str,
+    task: str,
+    quantization_type: str,
+    enable_onnx: bool,
+    onnx_quantization: str,
+    hf_token: str,
+    repo_name: str,
+    test_text: str
+) -> Tuple[Dict[str, Any], str, Dict[str, Any]]:
+    try:
+        resource_manager = ResourceManager()
+        status_updates = []
+        status = {
+            "status": "Processing",
+            "progress": 0,
+            "current_step": "Initializing",
+        }
+        metrics = {}
+        if not model_name or not hf_token or not repo_name:
+            return (
+                {"status": "Error", "progress": 0, "current_step": "Validation Failed"},
+                "Model name, HuggingFace token, and repository name are required.",
+                metrics
+            )
+        status["progress"] = 0.2
+        status["current_step"] = "Initialization"
+        status_updates.append("Initialization complete")
+        quantized_model_path = None
+        if quantization_type != "None":
+            status.update({"progress": 0.4, "current_step": "Quantization"})
+            status_updates.append(f"Applying {quantization_type} quantization")
+            if not test_text:
+                test_text = TASK_CONFIGS[task]["example_text"]
+            try:
+                handler = get_model_handler(task, model_name, quantization_type, test_text)
+                quantized_model = handler.compare()
+                metrics = handler.get_metrics()
+                metrics = json.loads(json.dumps(metrics))
+                quantized_model_path = str(resource_manager.temp_dirs["quantized"] / "model")
+                quantized_model.save_pretrained(quantized_model_path)
+                status_updates.append("Quantization completed successfully")
+            except Exception as e:
+                logger.error(f"Quantization error: {str(e)}", exc_info=True)
+                return (
+                    {"status": "Error", "progress": 0.4, "current_step": "Quantization Failed"},
+                    f"Quantization failed: {str(e)}",
+                    metrics
+                )
+        if enable_onnx:
+            status.update({"progress": 0.6, "current_step": "ONNX Conversion"})
+            status_updates.append("Converting to ONNX format")
+            try:
+                output_dir = str(resource_manager.temp_dirs["onnx"])
+                onnx_result = convert_to_onnx(model_name, task, output_dir)
+                if onnx_result is None:
+                    return (
+                        {"status": "Error", "progress": 0.6, "current_step": "ONNX Conversion Failed"},
+                        "ONNX conversion failed.",
+                        metrics
+                    )
+                if onnx_quantization != "None":
+                    status_updates.append(f"Applying {onnx_quantization} ONNX quantization")
+                    quantize_onnx_model(output_dir, onnx_quantization)
+                status.update({"progress": 0.8, "current_step": "Pushing ONNX Model"})
+                status_updates.append("Pushing ONNX model to Hub")
+                result, push_message = push_to_hub(
+                    local_path=output_dir,
+                    repo_name=f"{repo_name}-optimized",
+                    hf_token=hf_token,
+                    tags=["onnx", "optimum", task],
+                )
+                status_updates.append(push_message)
+            except Exception as e:
+                logger.error(f"ONNX error: {str(e)}", exc_info=True)
+                return (
+                    {"status": "Error", "progress": 0.6, "current_step": "ONNX Processing Failed"},
+                    f"ONNX processing failed: {str(e)}",
+                    metrics
+                )
+        if quantization_type != "None" and quantized_model_path:
+            status.update({"progress": 0.9, "current_step": "Pushing Quantized Model"})
+            status_updates.append("Pushing quantized model to Hub")
+            result, push_message = push_to_hub(
+                local_path=quantized_model_path,
+                repo_name=f"{repo_name}-optimized",
+                hf_token=hf_token,
+                tags=["quantized", task, quantization_type],
+            )
+            status_updates.append(push_message)
+        status.update({"progress": 1.0, "status": "Complete", "current_step": "Completed"})
+        cleanup_message = resource_manager.cleanup_temp_files()
+        status_updates.append(cleanup_message)
+        return (
+            status,
+            "\n".join(status_updates),
+            metrics
+        )
+    except Exception as e:
+        logger.error(f"Error during processing: {str(e)}", exc_info=True)
+        return (
+            {"status": "Error", "progress": 0, "current_step": "Process Failed"},
+            f"An error occurred: {str(e)}",
+            metrics
+        )
+# Gradio Interface
+with gr.Blocks(theme=gr.themes.Soft()) as app:
+    gr.Markdown("""
+    # 🤗 Model Conversion Hub
+    Convert and optimize your Hugging Face models with quantization and ONNX support.
+    """)
+    with gr.Row():
+        with gr.Column(scale=2):
+            model_name = gr.Textbox(label="Model Name", placeholder="e.g., bert-base-uncased")
+            task = gr.Dropdown(choices=list(TASK_CONFIGS.keys()), label="Task", value="text_classification")
+            with gr.Group():
+                gr.Markdown("### Quantization Settings")
+                quantization_type = gr.Dropdown(choices=["None", "4-bit", "8-bit", "16-bit-float"], label="Quantization Type", value="None")
+                test_text = gr.Textbox(label="Test Text", placeholder="Enter text for model evaluation", lines=3, visible=False)
+            with gr.Group():
+                gr.Markdown("### ONNX Settings")
+                enable_onnx = gr.Checkbox(label="Enable ONNX Conversion")
+                with gr.Group(visible=False) as onnx_group:
+                    onnx_quantization = gr.Dropdown(choices=["None", "8-bit", "16-bit-int", "16-bit-float"], label="ONNX Quantization", value="None")
+            with gr.Group():
+                gr.Markdown("### HuggingFace Settings")
+                hf_token = gr.Textbox(label="HuggingFace Token (Required)", type="password")
+                repo_name = gr.Textbox(label="Repository Name")
+        with gr.Column(scale=1):
+            status_output = gr.JSON(label="Status", value={"status": "Ready", "progress": 0, "current_step": "Waiting"})
+            message_output = gr.Markdown(label="Progress Messages")
+            gr.Markdown("### Metrics")
+            with gr.Group():
+                metrics_output = gr.JSON(
+                    value={
+                        "model_sizes": {"original": 0.0, "quantized": 0.0},
+                        "inference_times": {"original": 0.0, "quantized": 0.0},
+                        "comparison_metrics": {}
+                    },
+                    show_label=True
+                )
+            memory_info = gr.JSON(label="Resource Usage")
+            convert_btn = gr.Button("🚀 Start Conversion", variant="primary")
+            with gr.Accordion("ℹ️ Help", open=False):
+                gr.Markdown("""
+                ### Quick Guide
+                1. Enter your model name and HuggingFace token.
+                2. Select the appropriate task.
+                3. Choose optimization options.
+                4. Click Start Conversion.
+                ### Tips
+                - Ensure sufficient system resources.
+                - Use test text to validate conversions.
+                """)
+    def update_memory_info():
+        resource_manager = ResourceManager()
+        return resource_manager.get_memory_info()
+    quantization_type.change(lambda x: gr.update(visible=x != "None"), inputs=[quantization_type], outputs=[test_text])
+    task.change(lambda x: gr.update(value=TASK_CONFIGS[x]["example_text"]), inputs=[task], outputs=[test_text])
+    enable_onnx.change(lambda x: gr.update(visible=x), inputs=[enable_onnx], outputs=[onnx_group])
+    convert_btn.click(
+        process_model,
+        inputs=[model_name, task, quantization_type, enable_onnx, onnx_quantization, hf_token, repo_name, test_text],
+        outputs=[status_output, message_output, metrics_output]
+    )
+    app.load(update_memory_info, outputs=[memory_info], every=30)
+if __name__ == "__main__":
+    app.launch(server_name="0.0.0.0", server_port=7860, share=True, debug=True)

poetry.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

pyproject.toml ADDED Viewed

	@@ -0,0 +1,53 @@

+[tool.poetry]
+name = "autoquantnx"
+version = "0.1.0"
+description = "Webapp to quantize and convert to ONNX HF models in go and compare them"
+authors = ["kartikbhtt7 <[email protected]>"]
+readme = "README.md"
+[tool.poetry.dependencies]
+python = ">=3.10,<3.12"
+gradio = "^4.00.0"
+transformers = "^4.31.0"
+pandas = "^2.2.2"
+torch = "^2.3.0"
+onnx = "^1.16.2"
+onnxruntime = "^1.18.1"
+onnxconverter-common = ">=1.14.0"
+optimum = "^1.21.3"
+huggingface-hub = "^0.24.6"
+sentence-transformers = "^3.0.1"
+bitsandbytes = "^0.43.3"
+evaluate = "^0.4.0"
+faiss-gpu = "^1.7.2"
+faiss-cpu = "^1.8.0.post1"
+azure-cognitiveservices-speech = "^1.40.0"
+gdown = "^5.2.0"
+jiwer = "^3.0.4"
+pydub = "^0.25.1"
+librosa = "^0.10.2.post1"
+soundfile = "^0.12.1"
+catalogue = "^2.0.10"
+langchain-core = "^0.1.40"
+langchain-openai = "^0.1.0"
+fast-pytorch-kmeans = "^0.2.0.1"
+typing-extensions = "^4.12.2"
+textwrap3 = "^0.9.2"
+pynvml = "^11.5.3"
+psutil = "^6.1.1"
+accelerate = "^0.26.0"
+[tool.poetry.dev-dependencies]
+black = "^23.7.0"
+flake8 = "^6.1.0"
+pytest = "^7.4.3"
+pytest-asyncio = "^0.21.1"
+pytest-django = "^4.8.0"
+pytest-cov = "^4.1.0"
+pytest-testmon = "^2.1.0"
+pytest-watch = "^4.2.0"
+coverage = "^7.3.2"
+[build-system]
+requires = ["poetry>=1.5.1"]
+build-backend = "poetry.core.masonry.api"

requirements.txt ADDED Viewed

	@@ -0,0 +1,167 @@

+accelerate==0.26.1
+aiofiles==23.2.1
+aiohappyeyeballs==2.4.4
+aiohttp==3.11.11
+aiosignal==1.3.2
+annotated-types==0.7.0
+anyio==4.7.0
+async-timeout==5.0.1
+attrs==24.3.0
+audioread==3.0.1
+azure-cognitiveservices-speech==1.41.1
+beautifulsoup4==4.12.3
+bitsandbytes==0.43.3
+black==23.12.1
+catalogue==2.0.10
+certifi==2024.12.14
+cffi==1.17.1
+charset-normalizer==3.4.1
+click==8.1.8
+colorama==0.4.6
+coloredlogs==15.0.1
+contourpy==1.3.1
+coverage==7.6.10
+cycler==0.12.1
+datasets==2.14.4
+decorator==5.1.1
+dill==0.3.7
+distro==1.9.0
+docopt==0.6.2
+evaluate==0.4.3
+exceptiongroup==1.2.2
+faiss-cpu==1.9.0.post1
+faiss-gpu==1.7.2
+fast_pytorch_kmeans==0.2.2
+fastapi==0.115.6
+ffmpy==0.5.0
+filelock==3.16.1
+flake8==6.1.0
+flatbuffers==24.12.23
+fonttools==4.55.3
+frozenlist==1.5.0
+fsspec==2024.12.0
+gdown==5.2.0
+gradio==4.44.1
+gradio_client==1.3.0
+h11==0.14.0
+httpcore==1.0.7
+httpx==0.28.1
+huggingface-hub==0.24.7
+humanfriendly==10.0
+idna==3.10
+importlib_resources==6.4.5
+iniconfig==2.0.0
+Jinja2==3.1.5
+jiter==0.8.2
+jiwer==3.0.5
+joblib==1.4.2
+jsonpatch==1.33
+jsonpointer==3.0.0
+kiwisolver==1.4.8
+langchain-core==0.1.53
+langchain-openai==0.1.7
+langsmith==0.1.147
+lazy_loader==0.4
+librosa==0.10.2.post1
+llvmlite==0.43.0
+markdown-it-py==3.0.0
+MarkupSafe==2.1.5
+matplotlib==3.10.0
+mccabe==0.7.0
+mdurl==0.1.2
+mpmath==1.3.0
+msgpack==1.1.0
+multidict==6.1.0
+multiprocess==0.70.15
+mypy-extensions==1.0.0
+networkx==3.4.2
+numba==0.60.0
+numpy==1.26.4
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==9.1.0.70
+nvidia-cufft-cu12==11.0.2.54
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-nccl-cu12==2.20.5
+nvidia-nvjitlink-cu12==12.6.85
+nvidia-nvtx-cu12==12.1.105
+onnx==1.17.0
+onnxconverter-common==1.14.0
+onnxruntime==1.20.1
+openai==1.58.1
+optimum==1.23.3
+orjson==3.10.13
+packaging==23.2
+pandas==2.2.3
+pathspec==0.12.1
+pillow==10.4.0
+platformdirs==4.3.6
+pluggy==1.5.0
+pooch==1.8.2
+propcache==0.2.1
+protobuf==3.20.2
+psutil==6.1.1
+pyarrow==18.1.0
+pycodestyle==2.11.1
+pycparser==2.22
+pydantic==2.10.4
+pydantic_core==2.27.2
+pydub==0.25.1
+pyflakes==3.1.0
+Pygments==2.19.1
+pynvml==11.5.3
+pyparsing==3.2.1
+PySocks==1.7.1
+pytest==7.4.4
+pytest-asyncio==0.21.2
+pytest-cov==4.1.0
+pytest-django==4.9.0
+pytest-testmon==2.1.3
+pytest-watch==4.2.0
+python-dateutil==2.9.0.post0
+python-multipart==0.0.20
+pytz==2024.2
+PyYAML==6.0.2
+RapidFuzz==3.11.0
+regex==2024.11.6
+requests==2.32.3
+requests-toolbelt==1.0.0
+rich==13.9.4
+ruff==0.9.6
+safetensors==0.4.5
+scikit-learn==1.6.0
+scipy==1.14.1
+semantic-version==2.10.0
+sentence-transformers==3.3.1
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+soundfile==0.12.1
+soupsieve==2.6
+soxr==0.5.0.post1
+starlette==0.41.3
+sympy==1.13.3
+tenacity==8.5.0
+textwrap3==0.9.2
+threadpoolctl==3.5.0
+tiktoken==0.8.0
+tokenizers==0.21.0
+tomli==2.2.1
+tomlkit==0.12.0
+torch==2.4.1
+tqdm==4.67.1
+transformers==4.47.1
+triton==3.0.0
+typer==0.15.1
+typing_extensions==4.12.2
+tzdata==2024.2
+urllib3==2.3.0
+uvicorn==0.34.0
+watchdog==6.0.0
+websockets==11.0.3
+xxhash==3.5.0
+yarl==1.18.3

src/handlers/__init__.py ADDED Viewed

	@@ -0,0 +1,84 @@

+from .base_handler import ModelHandler
+from .nlp_models.sequence_classification_handler import SequenceClassificationHandler
+from .nlp_models.question_answering_handler import QuestionAnsweringHandler
+from .nlp_models.token_classification_handler import TokenClassificationHandler
+from .nlp_models.causal_lm_handler import CausalLMHandler
+from .nlp_models.embedding_model_handler import EmbeddingModelHandler
+from .audio_models.whisper_handler import WhisperHandler
+from .nlp_models.masked_lm_handler import MaskedLMHandler
+from .nlp_models.seq2seq_lm_handler import Seq2SeqLMHandler
+from .nlp_models.multiple_choice_handler import MultipleChoiceHandler
+from .img_models.image_classification_handler import ImageClassificationHandler
+from transformers import (
+    AutoModel,
+    AutoModelForTokenClassification,
+    AutoModelForSequenceClassification,
+    AutoModelForQuestionAnswering,
+    AutoModelForCausalLM,
+    AutoModelForMaskedLM,
+    AutoModelForSeq2SeqLM,
+    AutoModelForMultipleChoice,
+)
+TASK_CONFIGS = {
+    "embedding": {
+        "model_class": AutoModel,
+        "handler_class": EmbeddingModelHandler,
+        "example_text": "Hey, I am feeling way to good to be true.",
+    },
+    "ner": {
+        "model_class": AutoModelForTokenClassification,
+        "handler_class": TokenClassificationHandler,
+        "example_text": "John works at Google in New York as a software engineer.",
+    },
+    "text_classification": {
+        "model_class": AutoModelForSequenceClassification,
+        "handler_class": SequenceClassificationHandler,
+        "example_text": "This movie was great and I loved it.",
+    },
+    "question_answering": {
+        "model_class": AutoModelForQuestionAnswering,
+        "handler_class": QuestionAnsweringHandler,
+        "example_text": "The pyramids were built in ancient Egypt. QUES: Where were the pyramids built?",
+    },
+    "causal_lm": {
+        "model_class": AutoModelForCausalLM,
+        "handler_class": CausalLMHandler,
+        "example_text": "Once upon a time, there was ",
+    },
+    "mask_lm": {
+        "model_class": AutoModelForMaskedLM,
+        "handler_class": MaskedLMHandler,
+        "example_text": "The quick brown [MASK] jumps over the lazy dog.",
+    },
+    "seq2seq_lm": {
+        "model_class": AutoModelForSeq2SeqLM,
+        "handler_class": Seq2SeqLMHandler,
+        "example_text": "Translate English to French: The house is wonderful.",
+    },
+    "multiple_choice": {
+        "model_class": AutoModelForMultipleChoice,
+        "handler_class": MultipleChoiceHandler,
+        "example_text": "What is the capital of France? (A) Paris (B) London (C) Berlin (D) Rome",
+    },
+    "whisper_finetuning": {
+        "model_class": None, # Not implemented
+        "handler_class": WhisperHandler,
+        "example_text": "!!!!!NOT IMPLEMENTED!!!!!",
+    },
+    "image_classification": {
+        "model_class": None,  # Not implemented
+        "handler_class": ImageClassificationHandler,
+        "example_text": "!!!!!NOT IMPLEMENTED!!!!!",
+    },
+}
+def get_model_handler(task: str, model_name: str, quantization_type: str, test_text: str):
+    task_config = TASK_CONFIGS.get(task)
+    if not task_config:
+        raise ValueError(f"No configuration found for task: {task}")
+    handler_class = task_config["handler_class"]
+    model_class = task_config["model_class"]
+    return handler_class(model_name, model_class, quantization_type, test_text)

src/handlers/audio_models/whisper_handler.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from ..base_handler import ModelHandler
+class WhisperHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+    def run_inference(self, model, text):
+        raise NotImplementedError("STT is not implemented.")
+    def decode_output(self, outputs):
+        raise NotImplementedError("STT is not implemented.")
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        raise NotImplementedError("STT is not implemented.")

src/handlers/base_handler.py ADDED Viewed

	@@ -0,0 +1,148 @@

+from optimizations.quantize import ModelQuantizer
+import torch
+import logging
+import numpy as np
+from dataclasses import dataclass
+from typing import Dict, Any, Optional
+import json
+logger = logging.getLogger(__name__)
+@dataclass
+class ModelMetrics:
+    model_sizes: Dict[str, float]
+    inference_times: Dict[str, float]
+    comparison_metrics: Dict[str, Any]
+class ModelHandler:
+    """Base class for handling different types of models"""
+    def __init__(self, model_name, model_class, quantization_type, test_text=None):
+        self.model_name = model_name
+        self.model_class = model_class
+        self.quantization_type = quantization_type
+        self.test_text = test_text
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        # Load models
+        self.original_model = self._load_original_model()
+        self.quantized_model = self._load_quantized_model()
+        self.metrics: Optional[ModelMetrics] = None
+    def _load_original_model(self):
+        """Load the original model"""
+        model = self.model_class.from_pretrained(self.model_name)
+        return model.to(self.device)
+    def _load_quantized_model(self):
+        """Load the quantized model using ModelQuantizer"""
+        model = ModelQuantizer.quantize_model(
+            self.model_class,
+            self.model_name,
+            self.quantization_type
+        )
+        if self.quantization_type not in ["4-bit", "8-bit"]:
+            model = model.to(self.device)
+        return model
+    @staticmethod
+    def _convert_to_serializable(obj):
+        """Serialization for metrics"""
+        if isinstance(obj, np.generic):
+            return obj.item()
+        if isinstance(obj, (np.float32, np.float64)):
+            return float(obj)
+        if isinstance(obj, (np.int32, np.int64)):
+            return int(obj)
+        if isinstance(obj, np.ndarray):
+            return obj.tolist()
+        if isinstance(obj, torch.Tensor):
+            return obj.cpu().numpy().tolist()
+        if isinstance(obj, dict):
+            return {k: ModelHandler._convert_to_serializable(v) for k, v in obj.items()}
+        if isinstance(obj, list):
+            return [ModelHandler._convert_to_serializable(v) for v in obj]
+        return obj
+    def _format_metric_value(self, value):
+        """Format metric value based on its type"""
+        if isinstance(value, (float, np.float32, np.float64)):
+            return f"{value:.8f}"
+        elif isinstance(value, (int, np.int32, np.int64)):
+            return str(value)
+        elif isinstance(value, list):
+            return "\n" + "\n".join([f"  - {item}" for item in value])
+        elif isinstance(value, dict):
+            return "\n" + "\n".join([f"  {k}: {v}" for k, v in value.items()])
+        else:
+            return str(value)
+    def run_inference(self, model, text):
+        """Run model inference - to be implemented by subclasses"""
+        raise NotImplementedError
+    def decode_output(self, outputs):
+        """Decode model outputs - to be implemented by subclasses"""
+        raise NotImplementedError
+    def compare(self):
+        """Compare original and quantized models"""
+        try:
+            if self.test_text is None:
+                logger.warning("No test text provided. Skipping inference testing.")
+                return self.quantized_model
+            # Run inference
+            original_outputs, original_time = self.run_inference(self.original_model, self.test_text)
+            quantized_outputs, quantized_time = self.run_inference(self.quantized_model, self.test_text)
+            original_size = ModelQuantizer.get_model_size(self.original_model)
+            quantized_size = ModelQuantizer.get_model_size(self.quantized_model)
+            logger.info(f"Original Model Size: {original_size:.2f} MB")
+            logger.info(f"Quantized Model Size: {quantized_size:.2f} MB")
+            logger.info(f"Original Inference Time: {original_time:.4f} seconds")
+            logger.info(f"Quantized Inference Time: {quantized_time:.4f} seconds")
+            # Compare outputs
+            comparison_metrics = self.compare_outputs(original_outputs, quantized_outputs) or {}
+            for key, value in comparison_metrics.items():
+                comparison_metrics[key] = self._convert_to_serializable(value)
+            self.metrics = {
+                "model_sizes": {
+                    "original": float(original_size),
+                    "quantized": float(quantized_size)
+                },
+                "inference_times": {
+                    "original": float(original_time),
+                    "quantized": float(quantized_time)
+                },
+                "comparison_metrics": comparison_metrics
+            }
+            return self.quantized_model
+        except Exception as e:
+            logger.error(f"Quantization and comparison failed: {str(e)}")
+            raise e
+    def get_metrics(self) -> Dict[str, Any]:
+        """Return the metrics dictionary"""
+        if self.metrics is None:
+            return {
+                "model_sizes": {"original": 0.0, "quantized": 0.0},
+                "inference_times": {"original": 0.0, "quantized": 0.0},
+                "comparison_metrics": {}
+            }
+        serializable_metrics = self._convert_to_serializable(self.metrics)
+        try:
+            json.dumps(serializable_metrics)
+            return serializable_metrics
+        except (TypeError, ValueError) as e:
+            logger.error(f"Error serializing metrics: {str(e)}")
+            return {
+                "model_sizes": {"original": 0.0, "quantized": 0.0},
+                "inference_times": {"original": 0.0, "quantized": 0.0},
+                "comparison_metrics": {}
+            }

src/handlers/img_models/image_classification_handler.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from ..base_handler import ModelHandler
+class ImageClassificationHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+    def run_inference(self, model, text):
+        raise NotImplementedError("Image classification is not implemented.")
+    def decode_output(self, outputs):
+        raise NotImplementedError("Image classification is not implemented.")
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        raise NotImplementedError("Image classification is not implemented.")

src/handlers/nlp_models/causal_lm_handler.py ADDED Viewed

	@@ -0,0 +1,46 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+from scipy.stats import spearmanr
+import numpy as np
+class CausalLMHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model.generate(**inputs, max_length=50)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, outputs):
+        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        """Compare outputs for causal language models"""
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        original_tokens = original_outputs[0].cpu().numpy()
+        quantized_tokens = quantized_outputs[0].cpu().numpy()
+        metrics = {
+            'sequence_similarity': np.mean(original_tokens == quantized_tokens),
+            'sequence_length_diff': abs(len(original_tokens) - len(quantized_tokens)),
+            'vocab_distribution_correlation': spearmanr(
+                np.bincount(original_tokens),
+                np.bincount(quantized_tokens)
+            )[0] if len(original_tokens) == len(quantized_tokens) else 0.0
+        }
+        original_text = self.decode_output(original_outputs)
+        quantized_text = self.decode_output(quantized_outputs)
+        metrics['decoded_text_match'] = float(original_text == quantized_text)
+        metrics['original_model_text'] = original_text
+        metrics['quantized_model_text'] = quantized_text
+        return metrics

src/handlers/nlp_models/embedding_model_handler.py ADDED Viewed

	@@ -0,0 +1,49 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+import numpy as np
+from scipy.stats import spearmanr
+from sklearn.metrics.pairwise import cosine_similarity
+class EmbeddingModelHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model(**inputs)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, outputs):
+        return outputs.last_hidden_state.mean(dim=1).cpu().numpy()
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        """Compare outputs for embedding models"""
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        original_embeds = original_outputs.last_hidden_state.cpu().numpy()
+        quantized_embeds = quantized_outputs.last_hidden_state.cpu().numpy()
+        metrics = {
+            'mse': ((original_embeds - quantized_embeds) ** 2).mean(),
+            'cosine_similarity': cosine_similarity(
+                original_embeds.reshape(1, -1),
+                quantized_embeds.reshape(1, -1)
+            )[0][0],
+            'correlation': spearmanr(
+                original_embeds.flatten(),
+                quantized_embeds.flatten()
+            )[0],
+            'norm_difference': np.abs(
+                np.linalg.norm(original_embeds) -
+                np.linalg.norm(quantized_embeds)
+            )
+        }
+        return metrics

src/handlers/nlp_models/masked_lm_handler.py ADDED Viewed

	@@ -0,0 +1,39 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+import numpy as np
+class MaskedLMHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model(**inputs)
+        end_time = time.time()
+        return outputs, inputs, end_time - start_time
+    def decode_output(self, outputs, inputs):
+        logits = outputs.logits
+        masked_index = torch.where(inputs['input_ids'] == self.tokenizer.mask_token_id)[1]
+        predicted_token_id = logits[0, masked_index].argmax(axis=-1)
+        return self.tokenizer.decode(predicted_token_id)
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        original_logits = original_outputs.logits.detach().cpu().numpy()
+        quantized_logits = quantized_outputs.logits.detach().cpu().numpy()
+        metrics = {
+            'mse': ((original_logits - quantized_logits) ** 2).mean(),
+            'top_1_accuracy': np.mean(
+                np.argmax(original_logits, axis=-1) == np.argmax(quantized_logits, axis=-1)
+            ),
+        }
+        return metrics

src/handlers/nlp_models/multiple_choice_handler.py ADDED Viewed

	@@ -0,0 +1,39 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+import numpy as np
+class MultipleChoiceHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        choices = [text.split(f"({chr(65 + i)})")[1].strip() for i in range(4)]
+        inputs = self.tokenizer(choices, return_tensors='pt', padding=True).to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model(**inputs)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, outputs):
+        logits = outputs.logits
+        predicted_choice = chr(65 + logits.argmax().item())
+        return f"Predicted choice: {predicted_choice}"
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        original_logits = original_outputs.logits.detach().cpu().numpy()
+        quantized_logits = quantized_outputs.logits.detach().cpu().numpy()
+        metrics = {
+            'mse': ((original_logits - quantized_logits) ** 2).mean(),
+            'top_1_accuracy': np.mean(
+                np.argmax(original_logits, axis=-1) == np.argmax(quantized_logits, axis=-1)
+            ),
+        }
+        return metrics

src/handlers/nlp_models/question_answering_handler.py ADDED Viewed

	@@ -0,0 +1,59 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+class QuestionAnsweringHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        parts = text.split('QUES')
+        context = parts[0].strip()
+        question = parts[1].strip()
+        inputs = self.tokenizer(question, context, return_tensors='pt', truncation=True, padding=True).to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model(**inputs)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, outputs):
+        start_logits = outputs.start_logits
+        end_logits = outputs.end_logits
+        answer_start = torch.argmax(start_logits)
+        answer_end = torch.argmax(end_logits) + 1
+        input_ids = self.tokenizer.encode(self.test_text)
+        answer = self.tokenizer.decode(input_ids[answer_start:answer_end])
+        return f"Answer: {answer}"
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        """Compare outputs for question answering models"""
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        orig_start = original_outputs.start_logits.cpu().numpy()
+        orig_end = original_outputs.end_logits.cpu().numpy()
+        quant_start = quantized_outputs.start_logits.cpu().numpy()
+        quant_end = quantized_outputs.end_logits.cpu().numpy()
+        orig_start_pos = orig_start.argmax()
+        orig_end_pos = orig_end.argmax()
+        quant_start_pos = quant_start.argmax()
+        quant_end_pos = quant_end.argmax()
+        input_ids = self.tokenizer.encode(self.test_text)
+        original_answer = self.tokenizer.decode(input_ids[orig_start_pos:orig_end_pos + 1])
+        quantized_answer = self.tokenizer.decode(input_ids[quant_start_pos:quant_end_pos + 1])
+        metrics = {
+            'original_answer': original_answer,
+            'quantized_answer': quantized_answer,
+            'start_position_match': float(orig_start_pos == quant_start_pos),
+            'end_position_match': float(orig_end_pos == quant_end_pos),
+            'start_logits_mse': ((orig_start - quant_start) ** 2).mean(),
+            'end_logits_mse': ((orig_end - quant_end) ** 2).mean(),
+        }
+        return metrics

src/handlers/nlp_models/seq2seq_lm_handler.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+import numpy as np
+class Seq2SeqLMHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model.generate(**inputs, max_length=50)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, outputs):
+        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        original_tokens = original_outputs[0].cpu().numpy()
+        quantized_tokens = quantized_outputs[0].cpu().numpy()
+        metrics = {
+            'sequence_similarity': np.mean(original_tokens == quantized_tokens),
+            'sequence_length_diff': abs(len(original_tokens) - len(quantized_tokens)),
+        }
+        return metrics

src/handlers/nlp_models/sequence_classification_handler.py ADDED Viewed

	@@ -0,0 +1,49 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+import numpy as np
+class SequenceClassificationHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model(**inputs)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, outputs):
+        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+        predicted_class = torch.argmax(probabilities, dim=-1).item()
+        return f"Predicted class: {predicted_class}"
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        """Compare outputs for sequence classification models"""
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        orig_logits = original_outputs.logits.cpu().numpy()
+        quant_logits = quantized_outputs.logits.cpu().numpy()
+        orig_probs = torch.nn.functional.softmax(torch.tensor(orig_logits), dim=-1).numpy()
+        quant_probs = torch.nn.functional.softmax(torch.tensor(quant_logits), dim=-1).numpy()
+        orig_pred = orig_probs.argmax(axis=-1)
+        quant_pred = quant_probs.argmax(axis=-1)
+        metrics = {
+            'class_match': float(orig_pred == quant_pred),
+            'logits_mse': ((orig_logits - quant_logits) ** 2).mean(),
+            'probability_mse': ((orig_probs - quant_probs) ** 2).mean(),
+            'max_probability_diff': abs(orig_probs.max() - quant_probs.max()),
+            'kl_divergence': float(
+                (orig_probs * (np.log(orig_probs + 1e-10) - np.log(quant_probs + 1e-10))).sum()
+            )
+        }
+        return metrics

src/handlers/nlp_models/token_classification_handler.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from ..base_handler import ModelHandler
+from transformers import AutoTokenizer
+import torch
+import time
+class TokenClassificationHandler(ModelHandler):
+    def __init__(self, model_name, model_class, quantization_type, test_text):
+        super().__init__(model_name, model_class, quantization_type, test_text)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    def run_inference(self, model, text):
+        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(self.device)
+        start_time = time.time()
+        with torch.no_grad():
+            outputs = model(**inputs)
+        end_time = time.time()
+        return outputs, end_time - start_time
+    def decode_output(self, model, outputs):
+        tokens = self.tokenizer.convert_ids_to_tokens(outputs['input_ids'][0])
+        labels = torch.argmax(outputs.logits, dim=-1).squeeze().tolist()
+        decoded_labels = [model.config.id2label[label] for label in labels]
+        return dict(zip(tokens, decoded_labels))
+    def compare_outputs(self, original_outputs, quantized_outputs):
+        """Compare outputs for token classification models"""
+        if original_outputs is None or quantized_outputs is None:
+            return None
+        orig_logits = original_outputs.logits.cpu().numpy()
+        quant_logits = quantized_outputs.logits.cpu().numpy()
+        orig_preds = orig_logits.argmax(axis=-1)
+        quant_preds = quant_logits.argmax(axis=-1)
+        input_tokens = self.tokenizer.convert_ids_to_tokens(
+            self.tokenizer(self.test_text, return_tensors='pt')['input_ids'][0]
+        )
+        orig_labels = [self.original_model.config.id2label[p] for p in orig_preds[0]]
+        quant_labels = [self.quantized_model.config.id2label[p] for p in quant_preds[0]]
+        original_results = list(zip(input_tokens, orig_labels))
+        quantized_results = list(zip(input_tokens, quant_labels))
+        token_matches = sum(o_label == q_label for o_label, q_label in zip(orig_labels, quant_labels))
+        total_tokens = len(orig_labels)
+        metrics = {
+            'original_predictions': original_results,
+            'quantized_predictions': quantized_results,
+            'token_level_accuracy': float(token_matches) / total_tokens if total_tokens > 0 else 0.0,
+            'sequence_exact_match': float((orig_preds == quant_preds).all()),
+            'logits_mse': ((orig_logits - quant_logits) ** 2).mean(),
+        }
+        return metrics

src/optimizations/onnx_conversion.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import os
+from transformers import AutoTokenizer, WhisperProcessor, AutoFeatureExtractor
+from optimum.onnxruntime import (
+    ORTModelForQuestionAnswering,
+    ORTModelForCausalLM,
+    ORTModelForSequenceClassification,
+    ORTModelForTokenClassification,
+    ORTModelForSpeechSeq2Seq,
+    ORTOptimizer,
+    ORTModelForMaskedLM,
+    ORTModelForSeq2SeqLM,
+    ORTModelForMultipleChoice,
+    ORTModelForImageClassification,
+)
+import logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+TASK_MAPPING = {
+    # NLP models
+    "ner": (ORTModelForTokenClassification, AutoTokenizer),
+    "text_classification": (ORTModelForSequenceClassification, AutoTokenizer),
+    "question_answering": (ORTModelForQuestionAnswering, AutoTokenizer),
+    "causal_lm": (ORTModelForCausalLM, AutoTokenizer),
+    "mask_lm": (ORTModelForMaskedLM, AutoTokenizer),
+    "seq2seq_lm": (ORTModelForSeq2SeqLM, AutoTokenizer),
+    "multiple_choice": (ORTModelForMultipleChoice, AutoTokenizer),
+    # Audio models
+    "whisper_finetuning": (ORTModelForSpeechSeq2Seq, WhisperProcessor),
+    # Vision models
+    "image_classification": (ORTModelForImageClassification, AutoFeatureExtractor),
+}
+def convert_to_onnx(model_name, task, output_dir):
+    """
+    Convert model to ONNX format for the specified task.
+    """
+    logger.info(f"Converting model: {model_name} for task: {task}")
+    os.makedirs(output_dir, exist_ok=True)
+    if task not in TASK_MAPPING:
+        logger.error(f"Task {task} is not supported for ONNX conversion in this script.")
+        return None
+    ORTModelClass, ProcessorClass = TASK_MAPPING[task]
+    try:
+        if task == "embedding":
+            ort_optimizer = ORTOptimizer.from_pretrained(model_name)
+            ort_optimizer.export(output_dir=output_dir, task="feature-extraction")
+        else:
+            ort_model = ORTModelClass.from_pretrained(model_name, export=True)
+            ort_model.save_pretrained(output_dir)
+        processor = ProcessorClass.from_pretrained(model_name)
+        processor.save_pretrained(output_dir)
+        logger.info(f"Conversion complete. Model saved to: {output_dir}")
+        return output_dir
+    except Exception as e:
+        logger.error(f"Conversion failed: {str(e)}")
+        return None

src/optimizations/quantize.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import os
+import torch
+import onnx
+import logging
+from scipy.stats import spearmanr
+from sklearn.metrics.pairwise import cosine_similarity
+from transformers import BitsAndBytesConfig
+from onnxconverter_common import float16
+from onnxruntime.quantization import quantize_dynamic, QuantType
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class ModelQuantizer:
+    """Handles model quantization and comparison operations"""
+    @staticmethod
+    def quantize_model(model_class, model_name, quantization_type):
+        """Quantizes a model based on specified quantization type"""
+        try:
+            if quantization_type == "4-bit":
+                quantization_config = BitsAndBytesConfig(load_in_4bit=True)
+                model = model_class.from_pretrained(model_name, quantization_config=quantization_config)
+            elif quantization_type == "8-bit":
+                quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+                model = model_class.from_pretrained(model_name, quantization_config=quantization_config)
+            elif quantization_type == "16-bit-float":
+                model = model_class.from_pretrained(model_name)
+                model = model.to(torch.float16)
+            else:
+                raise ValueError(f"Unsupported quantization type: {quantization_type}")
+            return model
+        except Exception as e:
+            logger.error(f"Quantization failed: {str(e)}")
+            raise
+    @staticmethod
+    def get_model_size(model):
+        """Calculate model size in MB"""
+        try:
+            torch.save(model.state_dict(), "temp.pth")
+            size = os.path.getsize("temp.pth") / (1024 * 1024)
+            os.remove("temp.pth")
+            return size
+        except Exception as e:
+            logger.error(f"Failed to get model size: {str(e)}")
+            raise
+    @staticmethod
+    def compare_model_outputs(original_outputs, quantized_outputs):
+        """Compare outputs between original and quantized models"""
+        try:
+            if original_outputs is None or quantized_outputs is None:
+                return None
+            if hasattr(original_outputs, 'logits') and hasattr(quantized_outputs, 'logits'):
+                original_logits = original_outputs.logits.detach().cpu().numpy()
+                quantized_logits = quantized_outputs.logits.detach().cpu().numpy()
+                metrics = {
+                    'mse': ((original_logits - quantized_logits) ** 2).mean(),
+                    'spearman_corr': spearmanr(original_logits.flatten(), quantized_logits.flatten())[0],
+                    'cosine_sim': cosine_similarity(original_logits.reshape(1, -1), quantized_logits.reshape(1, -1))[0][0]
+                }
+                return metrics
+            return None
+        except Exception as e:
+            logger.error(f"Output comparison failed: {str(e)}")
+            raise
+def quantize_onnx_model(model_dir, quantization_type):
+    """
+    Quantize ONNX model in the specified directory.
+    """
+    logger.info(f"Quantizing ONNX model in: {model_dir}")
+    for filename in os.listdir(model_dir):
+        if filename.endswith('.onnx'):
+            input_model_path = os.path.join(model_dir, filename)
+            output_model_path = os.path.join(model_dir, f"quantized_{filename}")
+            try:
+                model = onnx.load(input_model_path)
+                if quantization_type == "16-bit-float":
+                    model_fp16 = float16.convert_float_to_float16(model)
+                    onnx.save(model_fp16, output_model_path)
+                elif quantization_type in ["8-bit", "16-bit-int"]:
+                    quant_type_mapping = {
+                        "8-bit": QuantType.QInt8,
+                        "16-bit-int": QuantType.QInt16,
+                    }
+                    quantize_dynamic(
+                        model_input=input_model_path,
+                        model_output=output_model_path,
+                        weight_type=quant_type_mapping[quantization_type]
+                    )
+                else:
+                    logger.error(f"Unsupported quantization type: {quantization_type}")
+                    continue
+                os.remove(input_model_path)
+                os.rename(output_model_path, input_model_path)
+                logger.info(f"Quantized ONNX model saved to: {input_model_path}")
+            except Exception as e:
+                logger.error(f"Error during ONNX quantization: {str(e)}")
+                if os.path.exists(output_model_path):
+                    os.remove(output_model_path)

src/utilities/push_to_hub.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import os
+import logging
+from pathlib import Path
+from typing import Optional, Dict, Tuple
+from huggingface_hub import HfApi, create_repo
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def push_to_hub(local_path: str, repo_name: str, hf_token: str, commit_message: Optional[str] = None, tags: Optional[list] = None) -> Tuple[Optional[str], str]:
+    """
+    Pushes a folder containing model files to the HuggingFace Hub.
+    Args:
+        local_path (str): Local directory containing the model files to upload.
+        repo_name (str): The repository name (not the full username/repo_name).
+        hf_token (str): HuggingFace authentication token.
+        commit_message (str, optional): Commit message for the upload.
+        tags (list, optional): Tags to include in the model card.
+    Returns:
+        Tuple[Optional[str], str]: (repository_name, status_message)
+    """
+    try:
+        api = HfApi(token=hf_token)
+        # Validate token
+        try:
+            user_info = api.whoami()
+            username = user_info["name"]
+        except Exception as e:
+            return None, f"❌ Authentication failed: Invalid token or network error ({str(e)})"
+        # Full repository name with the username
+        full_repo_name = f"{username}/{repo_name}"
+        # Create the repo
+        try:
+            create_repo(full_repo_name, token=hf_token, exist_ok=True)
+            logger.info(f"Repository created/verified: {full_repo_name}")
+        except Exception as e:
+            return None, f"❌ Repository creation failed: {str(e)}"
+        # Create model card
+        try:
+            tags_list = tags or []
+            tags_section = "\n".join(f"- {tag}" for tag in tags_list)
+            model_card = f"""---
+tags:
+{tags_section}
+library_name: optimum
+---
+# Model - {repo_name}
+This model has been optimized and uploaded to the HuggingFace Hub.
+## Model Details
+- Optimization Tags: {', '.join(tags_list)}
+"""
+            with open(os.path.join(local_path, "README.md"), "w") as f:
+                f.write(model_card)
+        except Exception as e:
+            logger.warning(f"Model card creation warning: {str(e)}")
+        # Upload the folder
+        try:
+            api.upload_folder(
+                folder_path=local_path,
+                repo_id=full_repo_name,
+                repo_type="model",
+                commit_message=commit_message or "Upload optimized model"
+            )
+            success_msg = f"✅ Model successfully pushed to: {full_repo_name}"
+            logger.info(success_msg)
+            return full_repo_name, success_msg
+        except Exception as e:
+            error_msg = f"❌ Upload failed: {str(e)}"
+            logger.error(error_msg)
+            return None, error_msg
+    except Exception as e:
+        error_msg = f"❌ Unexpected error during push: {str(e)}"
+        logger.error(error_msg)
+        return None, error_msg

src/utilities/resources.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import shutil
+import psutil
+import torch
+import logging
+from pathlib import Path
+from typing import Optional, Dict
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class ResourceManager:
+    def __init__(self, temp_dir: str = "temp"):
+        self.temp_dir = Path(temp_dir)
+        self.temp_dirs = {
+            "onnx": self.temp_dir / "onnx_output",
+            "quantized": self.temp_dir / "quantized_models",
+            "cache": self.temp_dir / "model_cache"
+        }
+        self.setup_directories()
+    def setup_directories(self):
+        for dir_path in self.temp_dirs.values():
+            dir_path.mkdir(parents=True, exist_ok=True)
+    def cleanup_temp_files(self, specific_dir: Optional[str] = None) -> str:
+        try:
+            if specific_dir:
+                if specific_dir in self.temp_dirs:
+                    shutil.rmtree(self.temp_dirs[specific_dir], ignore_errors=True)
+                    self.temp_dirs[specific_dir].mkdir(exist_ok=True)
+            else:
+                shutil.rmtree(self.temp_dir, ignore_errors=True)
+                self.setup_directories()
+            return "✨ Cleanup successful!"
+        except Exception as e:
+            logger.error(f"Cleanup failed: {str(e)}")
+            return f"❌ Cleanup failed: {str(e)}"
+    def get_memory_info(self) -> Dict[str, float]:
+        vm = psutil.virtual_memory()
+        memory_info = {
+            "total_ram": vm.total / (1024 ** 3),
+            "available_ram": vm.available / (1024 ** 3),
+            "used_ram": vm.used / (1024 ** 3)
+        }
+        if torch.cuda.is_available():
+            device = torch.cuda.current_device()
+            memory_info.update({
+                "gpu_total": torch.cuda.get_device_properties(device).total_memory / (1024 ** 3),
+                "gpu_used": torch.cuda.memory_allocated(device) / (1024 ** 3)
+            })
+        return memory_info