keras.preprocessing.sequence.TimeseriesGenerator(data, targets, length, sampling_rate=1, stride=1, start_index=0, end_index=None, shuffle=False, reverse=False, batch_size=128)
Utility class for generating batches of temporal data.
This class takes in a sequence of data-points gathered at equal intervals, along with time series parameters such as stride, length of history, etc., to produce batches for training/validation.
- data: Indexable generator (such as list or Numpy array) containing consecutive data points (timesteps). The data should be at 2D, and axis 0 is expected to be the time dimension.
- targets: Targets corresponding to timesteps in
data. It should have same length as
- length: Length of the output sequences (in number of timesteps).
- sampling_rate: Period between successive individual timesteps
within sequences. For rate
data[i - length]are used for create a sample sequence.
- stride: Period between successive output sequences.
s, consecutive output samples would be centered around
data[i+2*s], etc. start_index, end_index: Data points earlier than
start_indexor later than
end_indexwill not be used in the output sequences. This is useful to reserve part of the data for test or validation.
- shuffle: Whether to shuffle output samples, or instead draw them in chronological order.
- reverse: Boolean: if
true, timesteps in each output sample will be in reverse chronological order.
- batch_size: Number of timeseries samples in each batch (except maybe the last one).
A Sequence instance.
from keras.preprocessing.sequence import TimeseriesGenerator import numpy as np data = np.array([[i] for i in range(50)]) targets = np.array([[i] for i in range(50)]) data_gen = TimeseriesGenerator(data, targets, length=10, sampling_rate=2, batch_size=2) assert len(data_gen) == 20 batch_0 = data_gen x, y = batch_0 assert np.array_equal(x, np.array([[, , , , ], [, , , , ]])) assert np.array_equal(y, np.array([, ]))
pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
Pads sequences to the same length.
This function transforms a list of
num_samples sequences (lists of integers)
into a 2D Numpy array of shape
num_timesteps is either the
maxlen argument if provided,
or the length of the longest sequence otherwise.
Sequences that are shorter than
are padded with
value at the end.
Sequences longer than
num_timesteps are truncated
so that they fit the desired length.
The position where padding or truncation happens is determined by
Pre-padding is the default.
- sequences: List of lists, where each element is a sequence.
- maxlen: Int, maximum length of all sequences.
- dtype: Type of the output sequences.
- padding: String, 'pre' or 'post': pad either before or after each sequence.
- truncating: String, 'pre' or 'post':
remove values from sequences larger than
maxlen, either at the beginning or at the end of the sequences.
- value: Float, padding value.
- x: Numpy array with shape
- ValueError: In case of invalid values for
padding, or in case of invalid shape for a
skipgrams(sequence, vocabulary_size, window_size=4, negative_samples=1.0, shuffle=True, categorical=False, sampling_table=None, seed=None)
Generates skipgram word pairs.
This function transforms a sequence of word indexes (list of integers) into tuples of words of the form:
- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).
Read more about Skipgram in this gnomic paper by Mikolov et al.: Efficient Estimation of Word Representations in Vector Space
- sequence: A word sequence (sentence), encoded as a list
of word indices (integers). If using a
sampling_table, word indices are expected to match the rank of the words in a reference dataset (e.g. 10 would encode the 10-th most frequently occurring token). Note that index 0 is expected to be a non-word and will be skipped.
- vocabulary_size: Int, maximum possible word index + 1
- window_size: Int, size of sampling windows (technically half-window).
The window of a word
[i - window_size, i + window_size+1].
- negative_samples: Float >= 0. 0 for no negative (i.e. random) samples. 1 for same number as positive samples.
- shuffle: Whether to shuffle the word couples before returning them.
- categorical: bool. if False, labels will be
[0, 1, 1 .. ]), if
True, labels will be categorical, e.g.
[[1,0],[0,1],[0,1] .. ].
- sampling_table: 1D array of size
vocabulary_sizewhere the entry i encodes the probability to sample a word of rank i.
- seed: Random seed.
couples, labels: where
couples are int pairs and
labels are either 0 or 1.
By convention, index 0 in the vocabulary is a non-word and will be skipped.
Generates a word rank-based probabilistic sampling table.
Used for generating the
sampling_table argument for
sampling_table[i] is the probability of sampling
the word i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according to the sampling distribution used in word2vec:
p(word) = min(1, sqrt(word_frequency / sampling_factor) / (word_frequency / sampling_factor))
We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank):
frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
gamma is the Euler-Mascheroni constant.
- size: Int, number of possible words to sample.
- sampling_factor: The sampling factor in the word2vec formula.
A 1D Numpy array of length
size where the ith entry
is the probability that a word of rank i should be sampled.