Skip to content

francescocastelli/listen-attend-spell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 

Repository files navigation

listen-attend-spell

The model

The Listen, Attend and Spell model is an encoder-decoder neural network used for the task of automatic speech recognition (from speech to text). The encoder, named the listener, is a pyramidal RNN network that converts a speech signlal into a higher level feature reprentation. The decoder, named the speller, is an RNN takes these high level features and outputs a probability distribution over sequences of sequences of characters. The model is trained end-to-end.

Listener

The listener is an acoustic encoder that take as input a spectrogram-like representation x = (x1, ..., xT), where each xi is a time frame of our spectrogram representation. The goal of the listener is to map this input representation into some high level feature h = (h1, ..., hU), with the key constraint that U < T. Thus, the listener must reduced the number of time steps of the original signal into a more compressed representation h, allowing the attend and spell layer to extract relevant information from a reduced number of time steps.

The listener architecture is constructed by stacking multiple Bidirectional Long Short Term Memory RNN (BLSTM), that creates a pyramidal structure with multiple BLSM layers. The time step reduction is achive by concatenating two successive (in time) BLSTM outputs at each layer before feading them to the next BLSTM layer in the pyramid. Thus, the time resolution is reduce by a factor of 2 for each layer in the pyramid, i.e a 3 BLSTM layers pyramid performs a time reduction of 23 = 8.

Attend and Spell

The attend and spell is an attention-based LSTM transducer. Thus, at every output step it produces the probability distribution for the next character (over all the possible characters in the dictionary) conditioned on all the characters previously produced in output. This solve the issue of CTC, that assumes that the label outputs are conditionally independent of each other. Also, by directly producing characters in output there is no problem for Out-Of-Vocabulary (OOV) words.

The attend and spell architecture can be described as:

ci = AttentionContext(si, h)
si = RNN(si-1, yi-1, ci-1)
P(yi | x, y< i) = CharacterDistribution(si, ci)

where i is the current time step, c is the context and s is the RNN state. The context c is computed using an attention mechanism and encapsulate the information of the acoustic signal needed to generate the next character. The attention model is content based - the contents of the decoder state at each time step i si are matched to the contents of all hu in h. Thus, at each time step we compare the current RNN state si with all the acoustic information of input signal x encoded in h and keep the most relevant ones in the context ci. On convergence,the network learns to focused on only a few frames of h; ci can be seen as a continuous bag of weighted features of h.

The RNN network is a multi-layer LSTM network and the CharatectDistribution is an MLP network with softmax output over all the characters in the dictonary.

Training

During training, teacher forcing is used. Thus, the network maximizes the log probability:

log P(yi | x, y*< i)

where y*< i is the ground-truth character at time step i. During training the input of the multi-layer LSTM network in the attend and spell layers is the ground-truth sequence.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published