Quantizer classkeras.quantizers.Quantizer(output_dtype="int8")
AbsMaxQuantizer classkeras.quantizers.AbsMaxQuantizer(
axis=None, value_range=(-127, 127), epsilon=1e-07, output_dtype="int8"
)
QuantizationConfig classkeras.quantizers.QuantizationConfig(weight_quantizer=None, activation_quantizer=None)
Base class for quantization configs.
Subclasses must implement the mode property and the get_config and
from_config class methods.
Arguments
Int8QuantizationConfig classkeras.quantizers.Int8QuantizationConfig(
weight_quantizer=None, activation_quantizer="default"
)
Int8 quantization config.
Arguments
GPTQConfig classkeras.quantizers.GPTQConfig(
dataset,
tokenizer,
weight_bits: int = 4,
num_samples: int = 128,
per_channel: bool = True,
sequence_length: int = 512,
hessian_damping: float = 0.01,
group_size: int = 128,
symmetric: bool = False,
activation_order: bool = False,
quantization_layer_structure: dict = None,
)
Configuration class for the GPTQ (Gradient-based Post-Training Quantization) algorithm.
GPTQ is a post-training quantization method that quantizes neural network weights to lower precision (e.g., 4-bit) while minimizing the impact on model accuracy. It works by analyzing the Hessian matrix of the loss function with respect to the weights and applying optimal quantization that preserves the most important weight values.
When to use GPTQ: - You want to reduce model size and memory usage - You need faster inference on hardware that supports low-precision operations - You want to maintain model accuracy as much as possible - You have a pre-trained model that you want to quantize without retraining
How it works: 1. Uses calibration data to compute the Hessian matrix for each layer 2. Applies iterative quantization with error correction 3. Reorders weights based on activation importance (optional) 4. Quantizes weights while minimizing quantization error
Example usage:
from keras.quantizers import GPTQConfig
from keras import Model
# Create configuration for 4-bit quantization
config = GPTQConfig(
dataset=calibration_data, # Your calibration dataset
tokenizer=your_tokenizer, # Tokenizer for text data
weight_bits=4, # Quantize to 4 bits
num_samples=128, # Number of calibration samples
sequence_length=512, # Sequence length for each sample
hessian_damping=0.01, # Hessian stabilization factor
group_size=128, # Weight grouping for quantization
symmetric=False, # Use asymmetric quantization
activation_order=True # Reorder weights by importance
)
# Apply quantization to your model
model = Model(...) # Your pre-trained model
model.quantize("gptq", config=config)
# The model now has quantized weights and can be used for inference
Benefits: - Memory reduction: 4-bit quantization reduces memory by ~8x compared to float32 - Faster inference: Lower precision operations are faster on supported hardware - Accuracy preservation: Minimizes accuracy loss through optimal quantization - No retraining required: Works with pre-trained models
Advanced usage examples:
Per-channel quantization (recommended for most cases):
config = GPTQConfig(
dataset=calibration_data,
tokenizer=tokenizer,
weight_bits=4,
group_size=-1, # -1 enables per-channel quantization
symmetric=False
)
Grouped quantization (for specific hardware requirements):
config = GPTQConfig(
dataset=calibration_data,
tokenizer=tokenizer,
weight_bits=4,
group_size=64, # 64 weights share the same scale factor
symmetric=True # Use symmetric quantization
)
High-accuracy quantization with activation ordering:
config = GPTQConfig(
dataset=calibration_data,
tokenizer=tokenizer,
weight_bits=4,
activation_order=True, # Reorder weights by importance
hessian_damping=0.005, # Lower damping for more precise
# quantization
num_samples=256 # More samples for better accuracy
)
References: - Original GPTQ paper: "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" - Implementation based on: https://github.com/IST-DASLab/gptq - Suitable for: Transformer models, large language models, and other deep neural networks
Note: The quality of quantization depends heavily on the calibration dataset. Use representative data that covers the expected input distribution for best results.
Arguments
keras_nlp.Tokenizer instance (or a similar callable)
that is used to process the dataset if it contains strings.group_size of -1 indicates per-channel quantization.
Defaults to 128.True, uses symmetric quantization.
If False, uses asymmetric quantization. Defaults to False.True, reorders weight columns
based on activation magnitude, which can improve quantization
accuracy. Defaults to False.get_quantization_layer_structure.