► Developer guides / Customizing Quantization with QuantizationConfig

Customizing Quantization with QuantizationConfig

Author: Jyotinder Singh
Date created: 2025/12/18
Last modified: 2025/12/18
Description: Guide on using QuantizationConfig for weight-only quantization and custom quantizers.

View in Colab • GitHub source

Introduction

This guide explores the flexible QuantizationConfig API in Keras, introduced to give you granular control over how your models are quantized. While model.quantize("int8") provides a great default, you often need more control. For example, to perform weight-only quantization (common in LLMs) or to use custom quantization schemes (like percentile-based clipping).

We will cover:

Customizing INT8 Quantization: Modifying the default parameters (e.g., custom value range).
Weight-Only Quantization (INT4): Quantizing weights to 4-bit while keeping activations in float, using Int4QuantizationConfig.
Custom Quantizers: Implementing a completely custom quantizer (e.g., PercentileQuantizer) and using it with QuantizationConfig.

Setup

import keras
import numpy as np
from keras import ops

rng = np.random.default_rng()


def get_model():
    """Builds a simple Sequential model for demonstration."""
    return keras.Sequential(
        [
            keras.Input(shape=(10,)),
            keras.layers.Dense(32, activation="relu"),
            keras.layers.Dense(1),
        ]
    )

1. Customizing INT8 Quantization

By default, model.quantize("int8") uses AbsMaxQuantizer for both weights and activations which uses the default value range of [-127, 127]. You might want to specify different parameters, such as a restricted value range (if you expect your activations to be within a certain range). You can do this by creating an Int8QuantizationConfig.

from keras.quantizers import Int8QuantizationConfig, AbsMaxQuantizer

model = get_model()

# Create a custom config
# Here we restrict the weight range to [-100, 100] instead of the default [-127, 127]
custom_int8_config = Int8QuantizationConfig(
    weight_quantizer=AbsMaxQuantizer(value_range=(-100, 100), axis=0),
    activation_quantizer=AbsMaxQuantizer(value_range=(-100, 100), axis=-1),
)

# Apply quantization with the custom config
model.quantize(config=custom_int8_config)

print("Layer 0 kernel dtype:", model.layers[0].kernel.dtype)
# Ensure all kernel values are within the specified range
assert ops.all(
    ops.less_equal(model.layers[0].kernel, 100)
), "Kernel values are not <= 100"
assert ops.all(
    ops.greater_equal(model.layers[0].kernel, -100)
), "Kernel values are not >= -100"

Layer 0 kernel dtype: int8

2. Weight-Only Quantization (INT4)

By default, model.quantize("int4") quantizes activations to INT8 while keeping weights in INT4. For large language models and memory-constrained environments, weight-only quantization is a popular technique. It reduces the model size significantly (keeping weights in 4-bit) while maintaining higher precision for activations.

To achieve this, we set activation_quantizer=None in the Int4QuantizationConfig.

from keras.quantizers import Int4QuantizationConfig

model = get_model()

# Define Int4 weight-only config
# We enable Int4 for weights, but disable activation quantization by setting it to None.
# Note that we use `"int8"` as the output dtype since TensorFlow and PyTorch don't support
# `int4`. However, we still benefit from the lower memory usage of int4 weights because of
# bitpacking implemented by Keras.
custom_int4_config = Int4QuantizationConfig(
    weight_quantizer=AbsMaxQuantizer(value_range=(-8, 7), output_dtype="int8", axis=0),
    activation_quantizer=None,
)

model.quantize(config=custom_int4_config)

# Verify that weights are quantized (int8 backing int4) but no activation quantization logic is added
print("Layer 0 kernel dtype:", model.layers[0].kernel.dtype)
print("Layer 0 has inputs_quantizer:", model.layers[0].inputs_quantizer is not None)

Layer 0 kernel dtype: <dtype: 'int8'>
Layer 0 has inputs_quantizer: False

3. Custom Quantizers: Implementing a Percentile Quantizer

Sometimes, standard absolute-max quantization isn't enough. You might want to be robust to outliers by using percentile-based quantization. Keras allows you to define your own quantizer by subclassing keras.quantizers.Quantizer.

Below is an implementation of a PercentileQuantizer that sets the scale based on a specified percentile of the absolute values.

from keras.quantizers import Quantizer
from keras import backend


class PercentileQuantizer(Quantizer):
    """Quantizes x using the percentile-based scale."""

    def __init__(
        self,
        percentile=99.9,
        value_range=(-127, 127),  # Default range for int8
        epsilon=backend.epsilon(),
        output_dtype="int8",  # Default dtype for int8
    ):
        super().__init__(output_dtype=output_dtype)
        self.percentile = percentile
        self.value_range = value_range
        self.epsilon = epsilon

    def __call__(self, x, axis, to_numpy=False):
        """Quantizes x using the percentile-based scale.

        `to_numpy` can be set to True to perform the computation on the host CPU,
        which saves device memory.
        """
        # 1. Compute the percentile value of absolute inputs
        x_abs = ops.abs(x)

        if to_numpy:
            x_np = ops.convert_to_numpy(x_abs)
            max_val = np.percentile(x_np, self.percentile, axis=axis, keepdims=True)
        else:
            max_val = ops.quantile(
                x_abs, self.percentile / 100, axis=axis, keepdims=True
            )

        # 2. Compute scale
        # scale = range_max / max_val
        # We ensure max_val is at least epsilon
        scale = ops.divide(self.value_range[1], ops.add(max_val, self.epsilon))
        if not to_numpy:
            scale = ops.cast(scale, backend.standardize_dtype(x.dtype))

        # 3. Quantize
        # q = x * scale
        outputs = ops.multiply(x, scale)
        outputs = ops.clip(ops.round(outputs), self.value_range[0], self.value_range[1])
        outputs = ops.cast(outputs, self.output_dtype)

        return outputs, scale

    def get_config(self):
        """Returns the config of the quantizer for serialization support."""
        return {
            "percentile": self.percentile,
            "value_range": self.value_range,
            "epsilon": self.epsilon,
            "output_dtype": self.output_dtype,
        }

Now we can use this PercentileQuantizer in our configuration.

model = get_model()

# Use the custom quantizer for activations
custom_int8_config = Int8QuantizationConfig(
    weight_quantizer=AbsMaxQuantizer(axis=0),
    activation_quantizer=PercentileQuantizer(percentile=99.9),
)

model.quantize(config=custom_int8_config)

# Verify the integration
print(
    "Layer 0 uses custom activation quantizer:",
    isinstance(model.layers[0].inputs_quantizer, PercentileQuantizer),
)

Layer 0 uses custom activation quantizer: True

Conclusion