Gemma3nAudioConverter

[source]

Gemma3nAudioConverter class

keras_hub.layers.Gemma3nAudioConverter(
    feature_size,
    sampling_rate,
    padding_value,
    return_attention_mask,
    frame_length_ms,
    hop_length_ms,
    min_frequency,
    max_frequency,
    preemphasis,
    preemphasis_htk_flavor,
    fft_overdrive,
    dither,
    input_scale_factor,
    mel_floor,
    per_bin_mean,
    per_bin_stddev,
    padding_side,
    **kwargs
)

Converts raw audio waveforms into log-mel spectrograms.

This layer preprocesses 1D audio signals into 2D log-mel spectrograms suitable for the Gemma3n audio encoder. The conversion process involves padding or truncating the raw audio to a consistent length, applying optional dithering, input scaling, and preemphasis, and then computing the Short-Time Fourier Transform (STFT) with a Hann window. The resulting magnitude spectrogram is converted to the mel scale using a mel filterbank, after which the log-mel spectrogram is calculated by taking the logarithm. Finally, the layer can optionally normalize these features using provided per-bin mean and standard deviation statistics, and it returns both the spectrogram and an attention mask indicating which frames are valid.

Arguments

  • feature_size: int. The number of mel bins to generate.
  • sampling_rate: int. The expected sampling rate of the input audio.
  • padding_value: float. The value to use for padding the raw audio.
  • return_attention_mask: bool. Whether to return an attention mask.
  • frame_length_ms: float. The length of each STFT frame in milliseconds.
  • hop_length_ms: float. The step size between STFT frames in milliseconds.
  • min_frequency: float. The lowest frequency for the mel filterbank.
  • max_frequency: float. The highest frequency for the mel filterbank.
  • preemphasis: float. The coefficient for the preemphasis filter. Set to 0.0 to disable.
  • preemphasis_htk_flavor: bool. Whether to use the HTK-style preemphasis.
  • fft_overdrive: bool. If True, doubles the FFT length.
  • dither: float. Amount of dithering to add to the waveform. Set to 0.0 to disable.
  • input_scale_factor: float. Factor to scale the input waveform by.
  • mel_floor: float. A minimum value (floor) to apply before taking the logarithm.
  • per_bin_mean: list or None. A list of mean values for each mel bin, used for normalization.
  • per_bin_stddev: list or None. A list of standard deviation values for each mel bin, used for normalization.
  • padding_side: str. Which side to pad the audio on ('right' or 'left').

Call arguments

  • raw_speech: A raw audio waveform tensor, list of waveforms, or numpy array. Can be unbatched (1D) or batched (list of 1D arrays).
  • padding: str or bool. Padding strategy for batches. Options are "longest" (pad to longest sequence in batch), True (same as "longest"), or False (no padding). Defaults to "longest".
  • max_length: int. Maximum length to truncate or pad to. Defaults to 480000.
  • truncation: bool. Whether to truncate sequences longer than max_length. Defaults to True.
  • pad_to_multiple_of: int or None. If set, pad the sequence length to a multiple of this value. Defaults to 128.
  • return_attention_mask: bool. Whether to return an attention mask indicating valid (non-padded) frames. Defaults to True.

Examples

import numpy as np

# Create a simple audio signal (1 second of 440 Hz sine wave).
audio = np.sin(
    2 * np.pi * 440 * np.linspace(0, 1, 16000, dtype=np.float32)
)

# Initialize the audio converter
converter = keras_hub.layers.Gemma3nAudioConverter(
    feature_size=128,
    sampling_rate=16000,
    padding_value=0.0,
    return_attention_mask=True,
    frame_length_ms=32.0,
    hop_length_ms=10.0,
    min_frequency=125.0,
    max_frequency=7600.0,
    preemphasis=0.97,
    preemphasis_htk_flavor=True,
    fft_overdrive=True,
    dither=0.0,
    input_scale_factor=1.0,
    mel_floor=1e-5,
    per_bin_mean=None,
    per_bin_stddev=None,
    padding_side="right",
)

# Convert audio to log-mel spectrogram.
features, mask = converter(audio)
print(features.shape)  # (num_frames, 128)
print(mask.shape)      # (num_frames,)

# Convert a batch of audio with padding.
audio_1 = np.sin(
    2 * np.pi * 440 * np.linspace(0, 1, 16000, dtype=np.float32)
)
audio_2 = np.sin(
    2 * np.pi * 880 * np.linspace(0, 0.5, 8000, dtype=np.float32)
)
features, mask = converter(
    [audio_1, audio_2],
    padding="longest",
    pad_to_multiple_of=128,
)
print(features.shape)  # (2, num_frames, 128)

[source]

from_preset method

Gemma3nAudioConverter.from_preset(preset, **kwargs)

Instantiate a keras_hub.layers.AudioConverter from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as one of:

  1. a built-in preset identifier like 'whisper_base_en'
  2. a Kaggle Models handle like 'kaggle://user/whisper/keras/whisper_base_en'
  3. a Hugging Face handle like 'hf://user/whisper_base_en'
  4. a path to a local preset directory like './whisper_base_en'

You can run cls.presets.keys() to list all built-in presets available on the class.

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.AudioConverter.from_preset(), or from a model class like keras_hub.models.WhisperAudioConverter.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

Arguments

  • preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
  • load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load an audio converter from a preset.
converter = keras_hub.layers.AudioConverter.from_preset(
    "whisper_base_en"
)
# Convert some raw mono channel audio input.
converter(np.ones(2, 1_000))
Preset Parameters Description
gemma3n_e2b 5.44B Gemma 3n E2B multimodal model (~5B total, ~2B effective parameters) supporting multimodal inputs and optimized for on-device deployment.
gemma3n_e2b_it 5.44B Instruction-tuned Gemma 3n E2B multimodal model (~5B total, ~2B effective parameters) supporting multimodal inputs and optimized for on-device deployment.
gemma3n_e4b 7.85B Gemma 3n E4B multimodal with ( ~8B total ~4B effective parameters ), supporting multimodal inputs and optimized for on-device deployment.
gemma3n_e4b_it 7.85B Instruction-tuned Gemma 3n E4B multimodal with ~8B total (~4B effective parameters ), supporting multimodal inputs and optimized for on-device deployment.