What are efficient methods for post-training quantization to compress generative model sizes

Question

Name the effective methods for post-training quantization to compress generative model size.

Ashutosh · Answer 1 · Nov 22, 2024

Efficient methods for post-training quantization in generative models reduce model size are as follows:

Dynamic Quantization:
- Weights are quantized to lower precision during inference.
- Minimal accuracy loss, fast implementation
Static Quantization:
- Requires calibration with a dataset to map activations into quantized ranges.
- Produces better results than dynamic quantization for fixed workloads.
Quantization-Aware Training (QAT):
- Simulates quantization during training to minimize accuracy loss.
- Best for high accuracy on low-bit models but computationally expensive.
Weight Sharing:
- Groups weigh into clusters and store shared indices, reducing memory usage.

Hence, by referring to the above methods, you can post-training quantization to compress generative model sizes.

answered Nov 22, 2024 by Ashutosh
• 25,810 points

Your comment on this question: