Gemma4AudioEncoder classkeras_hub.models.Gemma4AudioEncoder(
input_feat_size=128,
hidden_size=1024,
num_heads=8,
num_layers=12,
chunk_size=12,
context_left=13,
context_right=0,
logit_cap=50.0,
invalid_logit_value=-1000000000.0,
conv_kernel_size=5,
reduction_factor=1,
residual_weight=0.5,
gradient_clipping=10000000000.0,
sscp_conv_channels=(128, 32),
sscp_kernel_sizes=((3, 3), (3, 3)),
sscp_stride_sizes=((2, 2), (2, 2)),
output_proj_dims=1536,
output_dim=2048,
norm_eps=1e-06,
sscp_norm_eps=1e-06,
dtype=None,
**kwargs
)
Audio encoder for Gemma4 based on the Universal Speech Model (USM).
Encodes mel spectrograms into audio token embeddings projected into the language model's hidden space. The pipeline is:
hidden_size.num_layers of them): macaron-FFW → chunk
attention with relative position bias → causal depthwise Conv1D →
macaron-FFW → RMS norm.reduction_factor > 1): reduce sequence by
taking every reduction_factor-th token.hidden_size → output_proj_dims followed
by another linear output_proj_dims → output_dim (= text hidden size)
and a parameter-free RMS norm.Padded positions (indicated by audio_mel_mask) are zeroed out in the
final output.
Arguments
128.1024.8.12.12.13.0.50.0.-1e9.5.1.0.5.1e10.(128, 32).(kT, kF) pairs. Defaults to
((3, 3), (3, 3)).(sT, sF) pairs. Defaults to
((2, 2), (2, 2)).None. Intermediate audio projection
dimension (e.g. 1536). None skips this projection.1e-6.1e-6.None.