zay25
/

MNLP_M2_quantized_model

Text Generation

multiple-choice

8-bit precision

compressed-tensors

Model card Files Files and versions

MNLP_M2_quantized_model / README.md

zay25's picture

adding yaml

2ac635d verified 3 months ago

|

history blame contribute delete

1.42 kB

	---
	license: mit
	language: en
	tags:
	- multiple-choice
	- quantization
	- W8A8
	- LLMCompressor
	- bf16
	- int8
	model_type: causal-lm
	base_model: hssawhney/mnlp-model
	pipeline_tag: text-generation
	---

	# Quantized MCQA Model – W8A8

	## Model Summary
	This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the [LLMCompressor](https://github.com/vllm-project/llm-compressor) framework.

	## Technical Details
	- Base model: [`hssawhney/mnlp-model`](https://huggingface.co/hssawhney/mnlp-model)
	- Quantization method: SmoothQuant + GPTQ
	- Precision: BF16 (activations) + INT8 (weights)
	- Calibration data: 512 samples from [`zay25/quantization-dataset`](https://huggingface.co/datasets/zay25/quantization-dataset)
	- Excluded layers: `lm_head` (to preserve output logits)
	- Final model size: ~717 MB

	## Evaluation
	The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.

	## Intended Use
	This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.