Gemma3nAudioConverter classkeras_hub.layers.Gemma3nAudioConverter(
feature_size,
sampling_rate,
padding_value,
return_attention_mask,
frame_length_ms,
hop_length_ms,
min_frequency,
max_frequency,
preemphasis,
preemphasis_htk_flavor,
fft_overdrive,
dither,
input_scale_factor,
mel_floor,
per_bin_mean,
per_bin_stddev,
padding_side,
**kwargs
)
Converts raw audio waveforms into log-mel spectrograms.
This layer preprocesses 1D audio signals into 2D log-mel spectrograms suitable for the Gemma3n audio encoder. The conversion process involves padding or truncating the raw audio to a consistent length, applying optional dithering, input scaling, and preemphasis, and then computing the Short-Time Fourier Transform (STFT) with a Hann window. The resulting magnitude spectrogram is converted to the mel scale using a mel filterbank, after which the log-mel spectrogram is calculated by taking the logarithm. Finally, the layer can optionally normalize these features using provided per-bin mean and standard deviation statistics, and it returns both the spectrogram and an attention mask indicating which frames are valid.
Arguments
Call arguments
"longest" (pad to longest sequence in batch), True (same as
"longest"), or False (no padding). Defaults to "longest".max_length. Defaults to True.True.Examples
import numpy as np
# Create a simple audio signal (1 second of 440 Hz sine wave).
audio = np.sin(
2 * np.pi * 440 * np.linspace(0, 1, 16000, dtype=np.float32)
)
# Initialize the audio converter
converter = keras_hub.layers.Gemma3nAudioConverter(
feature_size=128,
sampling_rate=16000,
padding_value=0.0,
return_attention_mask=True,
frame_length_ms=32.0,
hop_length_ms=10.0,
min_frequency=125.0,
max_frequency=7600.0,
preemphasis=0.97,
preemphasis_htk_flavor=True,
fft_overdrive=True,
dither=0.0,
input_scale_factor=1.0,
mel_floor=1e-5,
per_bin_mean=None,
per_bin_stddev=None,
padding_side="right",
)
# Convert audio to log-mel spectrogram.
features, mask = converter(audio)
print(features.shape) # (num_frames, 128)
print(mask.shape) # (num_frames,)
# Convert a batch of audio with padding.
audio_1 = np.sin(
2 * np.pi * 440 * np.linspace(0, 1, 16000, dtype=np.float32)
)
audio_2 = np.sin(
2 * np.pi * 880 * np.linspace(0, 0.5, 8000, dtype=np.float32)
)
features, mask = converter(
[audio_1, audio_2],
padding="longest",
pad_to_multiple_of=128,
)
print(features.shape) # (2, num_frames, 128)
from_preset methodGemma3nAudioConverter.from_preset(preset, **kwargs)
Instantiate a keras_hub.layers.AudioConverter from a model preset.
A preset is a directory of configs, weights and other file assets used
to save and load a pre-trained model. The preset can be passed as
one of:
'whisper_base_en''kaggle://user/whisper/keras/whisper_base_en''hf://user/whisper_base_en''./whisper_base_en'You can run cls.presets.keys() to list all built-in presets available
on the class.
This constructor can be called in one of two ways. Either from the base
class like keras_hub.models.AudioConverter.from_preset(), or from a
model class like keras_hub.models.WhisperAudioConverter.from_preset().
If calling from the base class, the subclass of the returning object
will be inferred from the config in the preset directory.
Arguments
True, the weights will be loaded into the
model architecture. If False, the weights will be randomly
initialized.Examples
# Load an audio converter from a preset.
converter = keras_hub.layers.AudioConverter.from_preset(
"whisper_base_en"
)
# Convert some raw mono channel audio input.
converter(np.ones(2, 1_000))
| Preset | Parameters | Description |
|---|---|---|
| gemma3n_e2b | 5.44B | Gemma 3n E2B multimodal model (~5B total, ~2B effective parameters) supporting multimodal inputs and optimized for on-device deployment. |
| gemma3n_e2b_it | 5.44B | Instruction-tuned Gemma 3n E2B multimodal model (~5B total, ~2B effective parameters) supporting multimodal inputs and optimized for on-device deployment. |
| gemma3n_e4b | 7.85B | Gemma 3n E4B multimodal with ( ~8B total ~4B effective parameters ), supporting multimodal inputs and optimized for on-device deployment. |
| gemma3n_e4b_it | 7.85B | Instruction-tuned Gemma 3n E4B multimodal with ~8B total (~4B effective parameters ), supporting multimodal inputs and optimized for on-device deployment. |