metadata

license: mit
language: en
tags:
  - multiple-choice
  - quantization
  - W8A8
  - LLMCompressor
  - bf16
  - int8
model_type: causal-lm
base_model: hssawhney/mnlp-model
pipeline_tag: text-generation

Quantized MCQA Model – W8A8

Model Summary

This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the LLMCompressor framework.

Technical Details

Base model: hssawhney/mnlp-model
Quantization method: SmoothQuant + GPTQ
Precision: BF16 (activations) + INT8 (weights)
Calibration data: 512 samples from zay25/quantization-dataset
Excluded layers: lm_head (to preserve output logits)
Final model size: ~717 MB

Evaluation

The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.

Intended Use

This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.