► KerasHub: Pretrained Models / API documentation / Model Architectures / Gemma3n / Gemma3nAudioConverter

Gemma3nAudioConverter

`Gemma3nAudioConverter` class

keras_hub.layers.Gemma3nAudioConverter(
    feature_size,
    sampling_rate,
    padding_value,
    return_attention_mask,
    frame_length_ms,
    hop_length_ms,
    min_frequency,
    max_frequency,
    preemphasis,
    preemphasis_htk_flavor,
    fft_overdrive,
    dither,
    input_scale_factor,
    mel_floor,
    per_bin_mean,
    per_bin_stddev,
    padding_side,
    **kwargs
)

Converts raw audio waveforms into log-mel spectrograms.

This layer preprocesses 1D audio signals into 2D log-mel spectrograms suitable for the Gemma3n audio encoder. The conversion process involves padding or truncating the raw audio to a consistent length, applying optional dithering, input scaling, and preemphasis, and then computing the Short-Time Fourier Transform (STFT) with a Hann window. The resulting magnitude spectrogram is converted to the mel scale using a mel filterbank, after which the log-mel spectrogram is calculated by taking the logarithm. Finally, the layer can optionally normalize these features using provided per-bin mean and standard deviation statistics, and it returns both the spectrogram and an attention mask indicating which frames are valid.

Arguments

feature_size: int. The number of mel bins to generate.
sampling_rate: int. The expected sampling rate of the input audio.
padding_value: float. The value to use for padding the raw audio.
return_attention_mask: bool. Whether to return an attention mask.
frame_length_ms: float. The length of each STFT frame in milliseconds.
hop_length_ms: float. The step size between STFT frames in milliseconds.
min_frequency: float. The lowest frequency for the mel filterbank.
max_frequency: float. The highest frequency for the mel filterbank.
preemphasis: float. The coefficient for the preemphasis filter. Set to 0.0 to disable.
preemphasis_htk_flavor: bool. Whether to use the HTK-style preemphasis.
fft_overdrive: bool. If True, doubles the FFT length.
dither: float. Amount of dithering to add to the waveform. Set to 0.0 to disable.
input_scale_factor: float. Factor to scale the input waveform by.
mel_floor: float. A minimum value (floor) to apply before taking the logarithm.
per_bin_mean: list or None. A list of mean values for each mel bin, used for normalization.
per_bin_stddev: list or None. A list of standard deviation values for each mel bin, used for normalization.
padding_side: str. Which side to pad the audio on ('right' or 'left').

Call arguments

raw_speech: A raw audio waveform tensor, list of waveforms, or numpy array. Can be unbatched (1D) or batched (list of 1D arrays).
padding: str or bool. Padding strategy for batches. Options are "longest" (pad to longest sequence in batch), True (same as "longest"), or False (no padding). Defaults to "longest".
max_length: int. Maximum length to truncate or pad to. Defaults to 480000.
truncation: bool. Whether to truncate sequences longer than max_length. Defaults to True.
pad_to_multiple_of: int or None. If set, pad the sequence length to a multiple of this value. Defaults to 128.
return_attention_mask: bool. Whether to return an attention mask indicating valid (non-padded) frames. Defaults to True.

Examples

import numpy as np

# Create a simple audio signal (1 second of 440 Hz sine wave).
audio = np.sin(
    2 * np.pi * 440 * np.linspace(0, 1, 16000, dtype=np.float32)
)

# Initialize the audio converter
converter = keras_hub.layers.Gemma3nAudioConverter(
    feature_size=128,
    sampling_rate=16000,
    padding_value=0.0,
    return_attention_mask=True,
    frame_length_ms=32.0,
    hop_length_ms=10.0,
    min_frequency=125.0,
    max_frequency=7600.0,
    preemphasis=0.97,
    preemphasis_htk_flavor=True,
    fft_overdrive=True,
    dither=0.0,
    input_scale_factor=1.0,
    mel_floor=1e-5,
    per_bin_mean=None,
    per_bin_stddev=None,
    padding_side="right",
)

# Convert audio to log-mel spectrogram.
features, mask = converter(audio)
print(features.shape)  # (num_frames, 128)
print(mask.shape)      # (num_frames,)

# Convert a batch of audio with padding.
audio_1 = np.sin(
    2 * np.pi * 440 * np.linspace(0, 1, 16000, dtype=np.float32)
)
audio_2 = np.sin(
    2 * np.pi * 880 * np.linspace(0, 0.5, 8000, dtype=np.float32)
)
features, mask = converter(
    [audio_1, audio_2],
    padding="longest",
    pad_to_multiple_of=128,
)
print(features.shape)  # (2, num_frames, 128)

[source]

`from_preset` method

Gemma3nAudioConverter.from_preset(preset, **kwargs)

Instantiate a keras_hub.layers.AudioConverter from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as one of:

a built-in preset identifier like 'whisper_base_en'
a Kaggle Models handle like 'kaggle://user/whisper/keras/whisper_base_en'
a Hugging Face handle like 'hf://user/whisper_base_en'
a path to a local preset directory like './whisper_base_en'

You can run cls.presets.keys() to list all built-in presets available on the class.

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.AudioConverter.from_preset(), or from a model class like keras_hub.models.WhisperAudioConverter.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

Arguments

preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load an audio converter from a preset.
converter = keras_hub.layers.AudioConverter.from_preset(
    "whisper_base_en"
)
# Convert some raw mono channel audio input.
converter(np.ones(2, 1_000))

Preset	Parameters	Description
gemma3n_e2b	5.44B	Gemma 3n E2B multimodal model (~5B total, ~2B effective parameters) supporting multimodal inputs and optimized for on-device deployment.
gemma3n_e2b_it	5.44B	Instruction-tuned Gemma 3n E2B multimodal model (~5B total, ~2B effective parameters) supporting multimodal inputs and optimized for on-device deployment.
gemma3n_e4b	7.85B	Gemma 3n E4B multimodal with ( ~8B total ~4B effective parameters ), supporting multimodal inputs and optimized for on-device deployment.
gemma3n_e4b_it	7.85B	Instruction-tuned Gemma 3n E4B multimodal with ~8B total (~4B effective parameters ), supporting multimodal inputs and optimized for on-device deployment.

Gemma3nAudioConverter

Gemma3nAudioConverter class

from_preset method

Gemma3nAudioConverter

Gemma3nAudioConverter class

from_preset method

`Gemma3nAudioConverter` class

`from_preset` method