File size: 1,424 Bytes
2ac635d 019c757 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
---
license: mit
language: en
tags:
- multiple-choice
- quantization
- W8A8
- LLMCompressor
- bf16
- int8
model_type: causal-lm
base_model: hssawhney/mnlp-model
pipeline_tag: text-generation
---
# Quantized MCQA Model – W8A8
## Model Summary
This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the [LLMCompressor](https://github.com/vllm-project/llm-compressor) framework.
## Technical Details
- **Base model:** [`hssawhney/mnlp-model`](https://huggingface.co/hssawhney/mnlp-model)
- **Quantization method:** SmoothQuant + GPTQ
- **Precision:** BF16 (activations) + INT8 (weights)
- **Calibration data:** 512 samples from [`zay25/quantization-dataset`](https://huggingface.co/datasets/zay25/quantization-dataset)
- **Excluded layers:** `lm_head` (to preserve output logits)
- **Final model size:** ~717 MB
## Evaluation
The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a **0.02 decrease in accuracy** compared to the full-precision (FP32) version.
## Intended Use
This model is optimized for **efficient inference** in **multiple-choice question answering** tasks, particularly in the context of **STEM tutoring**. It is well-suited for low-resource deployment environments where latency and memory usage are critical.
|