AudioConverter layer

[source]

AudioConverter class

keras_hub.layers.AudioConverter(**kwargs)

Convert raw audio for models that support audio input.

This class converts from raw audio tensors of any length, to preprocessed audio for pretrained model inputs. It is meant to be a convenient way to write custom preprocessing code that is not model specific. This layer should be instantiated via the from_preset() constructor, which will create the correct subclass of this layer for the model preset.

The layer will take as input a raw audio tensor with shape (batch_size, num_samples), and output a preprocessed audio input for modeling. The exact structure of the preprocessed input will vary per model. Preprocessing will often include computing a spectogram of the raw audio signal.

Examples

# Load an audio converter from a preset.
converter = keras_hub.layers.AudioConverter.from_preset("whisper_base_en")
# Convert some raw audio input.
converter(np.ones(2, 1_000))

[source]

from_preset method

AudioConverter.from_preset(preset, **kwargs)

Instantiate a keras_hub.layers.AudioConverter from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as one of:

  1. a built-in preset identifier like 'whisper_base_en'
  2. a Kaggle Models handle like 'kaggle://user/whisper/keras/whisper_base_en'
  3. a Hugging Face handle like 'hf://user/whisper_base_en'
  4. a path to a local preset directory like './whisper_base_en'

You can run cls.presets.keys() to list all built-in presets available on the class.

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.AudioConverter.from_preset(), or from a model class like keras_hub.models.WhisperAudioConverter.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

Arguments

  • preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
  • load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load an audio converter from a preset.
converter = keras_hub.layers.AudioConverter.from_preset(
    "whisper_base_en"
)
# Convert some raw mono channel audio input.
converter(np.ones(2, 1_000))
Preset Parameters Description
gemma4_2b 5.10B Gemma 4 E2B base model: 2.3B effective parameters (5.1B total with Per-Layer Embeddings), 35-layer, audio+vision+text pretrained Gemma4 model. The 'E' denotes effective parameters — PLE gives each decoder layer its own token embedding table, maximizing parameter efficiency for on-device deployment.
gemma4_instruct_2b 5.10B Gemma 4 E2B instruction-tuned model: 2.3B effective parameters (5.1B total with Per-Layer Embeddings), 35-layer, audio+vision+text instruction-tuned Gemma4 model. The 'E' denotes effective parameters — PLE gives each decoder layer its own token embedding table, maximizing parameter efficiency for on-device deployment.
gemma4_4b 7.90B Gemma 4 E4B base model: 4.5B effective parameters (7.9B total with Per-Layer Embeddings), 42-layer, audio+vision+text pretrained Gemma4 model. The 'E' denotes effective parameters — PLE gives each decoder layer its own token embedding table, maximizing parameter efficiency for on-device deployment.
gemma4_instruct_4b 7.90B Gemma 4 E4B instruction-tuned model: 4.5B effective parameters (7.9B total with Per-Layer Embeddings), 42-layer, audio+vision+text instruction-tuned Gemma4 model. The 'E' denotes effective parameters — PLE gives each decoder layer its own token embedding table, maximizing parameter efficiency for on-device deployment.
gemma4_26b_a4b 26.00B Gemma 4 26B A4B base model: Mixture-of-Experts (MoE) model with 26B total parameters and only 4B active parameters per forward pass, 30-layer, vision+text pretrained Gemma4 model. The 'A' denotes active parameters — by activating only a 4B subset during inference, this MoE model runs nearly as fast as a dense 4B model.
gemma4_instruct_26b_a4b 26.00B Gemma 4 26B A4B instruction-tuned model: Mixture-of-Experts (MoE) model with 26B total parameters and only 4B active parameters per forward pass, 30-layer, vision+text instruction-tuned Gemma4 model. The 'A' denotes active parameters — by activating only a 4B subset during inference, this MoE model runs nearly as fast as a dense 4B model.
gemma4_31b 31.00B Gemma 4 31B base model: 31B parameter, 60-layer, dense vision+text pretrained Gemma4 model. The dense model in the Gemma 4 family, offering maximum quality for deployments where inference speed is less of a constraint.
gemma4_instruct_31b 31.00B Gemma 4 31B instruction-tuned model: 31B parameter, 60-layer, dense vision+text instruction-tuned Gemma4 model. The dense model in the Gemma 4 family, offering maximum quality for deployments where inference speed is less of a constraint.
moonshine_tiny_en 27.09M Moonshine tiny model for English speech recognition. Developed by Useful Sensors for real-time transcription.
moonshine_base_en 61.51M Moonshine base model for English speech recognition. Developed by Useful Sensors for real-time transcription.
whisper_tiny_en 37.18M 4-layer Whisper model. Trained on 438,000 hours of labelled English speech data.
whisper_tiny_multi 37.76M 4-layer Whisper model. Trained on 680,000 hours of labelled multilingual speech data.
whisper_base_multi 72.59M 6-layer Whisper model. Trained on 680,000 hours of labelled multilingual speech data.
whisper_base_en 124.44M 6-layer Whisper model. Trained on 438,000 hours of labelled English speech data.
whisper_small_en 241.73M 12-layer Whisper model. Trained on 438,000 hours of labelled English speech data.
whisper_small_multi 241.73M 12-layer Whisper model. Trained on 680,000 hours of labelled multilingual speech data.
whisper_medium_en 763.86M 24-layer Whisper model. Trained on 438,000 hours of labelled English speech data.
whisper_medium_multi 763.86M 24-layer Whisper model. Trained on 680,000 hours of labelled multilingual speech data.
whisper_large_multi 1.54B 32-layer Whisper model. Trained on 680,000 hours of labelled multilingual speech data.
whisper_large_multi_v2 1.54B 32-layer Whisper model. Trained for 2.5 epochs on 680,000 hours of labelled multilingual speech data. An improved of whisper_large_multi.