metadata
license: mit
language: en
tags:
- multiple-choice
- quantization
- W8A8
- LLMCompressor
- bf16
- int8
model_type: causal-lm
base_model: hssawhney/mnlp-model
pipeline_tag: text-generation
Quantized MCQA Model – W8A8
Model Summary
This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the LLMCompressor framework.
Technical Details
- Base model:
hssawhney/mnlp-model
- Quantization method: SmoothQuant + GPTQ
- Precision: BF16 (activations) + INT8 (weights)
- Calibration data: 512 samples from
zay25/quantization-dataset
- Excluded layers:
lm_head
(to preserve output logits) - Final model size: ~717 MB
Evaluation
The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.
Intended Use
This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.