|
--- |
|
license: mit |
|
language: en |
|
tags: |
|
- multiple-choice |
|
- quantization |
|
- W8A8 |
|
- LLMCompressor |
|
- bf16 |
|
- int8 |
|
model_type: causal-lm |
|
base_model: hssawhney/mnlp-model |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Quantized MCQA Model – W8A8 |
|
|
|
## Model Summary |
|
This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the [LLMCompressor](https://github.com/vllm-project/llm-compressor) framework. |
|
|
|
## Technical Details |
|
- **Base model:** [`hssawhney/mnlp-model`](https://huggingface.co/hssawhney/mnlp-model) |
|
- **Quantization method:** SmoothQuant + GPTQ |
|
- **Precision:** BF16 (activations) + INT8 (weights) |
|
- **Calibration data:** 512 samples from [`zay25/quantization-dataset`](https://huggingface.co/datasets/zay25/quantization-dataset) |
|
- **Excluded layers:** `lm_head` (to preserve output logits) |
|
- **Final model size:** ~717 MB |
|
|
|
## Evaluation |
|
The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a **0.02 decrease in accuracy** compared to the full-precision (FP32) version. |
|
|
|
## Intended Use |
|
This model is optimized for **efficient inference** in **multiple-choice question answering** tasks, particularly in the context of **STEM tutoring**. It is well-suited for low-resource deployment environments where latency and memory usage are critical. |
|
|