Attention
classkeras.layers.Attention(
use_scale=False, score_mode="dot", dropout=0.0, seed=None, **kwargs
)
Dot-product attention layer, a.k.a. Luong-style attention.
Inputs are a list with 2 or 3 elements:
1. A query
tensor of shape (batch_size, Tq, dim)
.
2. A value
tensor of shape (batch_size, Tv, dim)
.
3. A optional key
tensor of shape (batch_size, Tv, dim)
. If none
supplied, value
will be used as a key
.
The calculation follows the steps:
1. Calculate attention scores using query
and key
with shape
(batch_size, Tq, Tv)
.
2. Use scores to calculate a softmax distribution with shape
(batch_size, Tq, Tv)
.
3. Use the softmax distribution to create a linear combination of value
with shape (batch_size, Tq, dim)
.
Arguments
True
, will create a scalar variable to scale the
attention scores.0.0
.dropout
.{"dot", "concat"}
. "dot"
refers to the dot product between the
query and key vectors. "concat"
refers to the hyperbolic tangent
of the concatenation of the query
and key
vectors.Call # Arguments
inputs: List of the following tensors:
- query
: Query tensor of shape (batch_size, Tq, dim)
.
- value
: Value tensor of shape (batch_size, Tv, dim)
.
- key
: Optional key tensor of shape (batch_size, Tv, dim)
. If
not given, will use value
for both key
and value
, which is
the most common case.
mask: List of the following tensors:
- query_mask
: A boolean mask tensor of shape (batch_size, Tq)
.
If given, the output will be zero at the positions where
mask==False
.
- value_mask
: A boolean mask tensor of shape (batch_size, Tv)
.
If given, will apply the mask such that values at positions
where mask==False
do not contribute to the result.
return_attention_scores: bool, it True
, returns the attention scores
(after masking and softmax) as an additional output argument.
training: Python boolean indicating whether the layer should behave in
training mode (adding dropout) or in inference mode (no dropout).
use_causal_mask: Boolean. Set to True
for decoder self-attention. Adds
a mask such that position i
cannot attend to positions j > i
.
This prevents the flow of information from the future towards the
past. Defaults to False
.
Output:
Attention outputs of shape (batch_size, Tq, dim)
.
(Optional) Attention scores after masking and softmax with shape
(batch_size, Tq, Tv)
.