Attention classkeras.layers.Attention(
use_scale=False, score_mode="dot", dropout=0.0, seed=None, **kwargs
)
Dot-product attention layer, a.k.a. Luong-style attention.
Inputs are a list with 2 or 3 elements:
1. A query tensor of shape (batch_size, Tq, dim).
2. A value tensor of shape (batch_size, Tv, dim).
3. A optional key tensor of shape (batch_size, Tv, dim). If none
supplied, value will be used as a key.
The calculation follows the steps:
1. Calculate attention scores using query and key with shape
(batch_size, Tq, Tv).
2. Use scores to calculate a softmax distribution with shape
(batch_size, Tq, Tv).
3. Use the softmax distribution to create a linear combination of value
with shape (batch_size, Tq, dim).
Arguments
True, will create a scalar variable to scale the
attention scores.0.0.dropout.{"dot", "concat"}. "dot" refers to the dot product between the
query and key vectors. "concat" refers to the hyperbolic tangent
of the concatenation of the query and key vectors.Call arguments
query: Query tensor of shape (batch_size, Tq, dim).value: Value tensor of shape (batch_size, Tv, dim).key: Optional key tensor of shape (batch_size, Tv, dim). If
not given, will use value for both key and value, which is
the most common case.query_mask: A boolean mask tensor of shape (batch_size, Tq).
If given, the output will be zero at the positions where
mask==False.value_mask: A boolean mask tensor of shape (batch_size, Tv).
If given, will apply the mask such that values at positions
where mask==False do not contribute to the result.True, returns the attention scores
(after masking and softmax) as an additional output argument.True for decoder self-attention. Adds
a mask such that position i cannot attend to positions j > i.
This prevents the flow of information from the future towards the
past. Defaults to False.Output:
Attention outputs of shape (batch_size, Tq, dim).
(Optional) Attention scores after masking and softmax with shape
(batch_size, Tq, Tv).