# Sequence Models - Recurrent Neural Networks

Examples of sequence data:

• Speech recognition
• Music generation
• Sentiment classification
• DNA sequence analysis
• Machine translation
• Video activity recognition
• Name entity recognition

# Recurrent Neural Network Model

## Why not a standard network?

• Inputs, outputs can be different lengths in different examples.
• Doesn't share features learned across different position of text.

## Weakness of RNN

only use the earlier information in sequence

## Forward Propagation

\begin{aligned} a^{(t)} &= g_1(w_{aa} a^{<t - 1>} + w_{ax} x^{<t>} + b_a)\\ &= g(w_a \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_a)\\ y^{(t)} &= g_2(w_{ya} a^{<t>} + b_y) \end{aligned}

• the activation $$g_1$$ will often be a $$\tanh$$​ in choice of RNN
• $$g_2$$​ will often be
• binary classification problem: $$sigmoid$$
• k-way classification problem: $$softmax$$

## Back Propagataion

\begin{aligned} \mathcal L^{<t>} (\hat y^{<t>}, y^{<t>}) &= -y^{<t>} \log \hat y^{<t>} - (1 - y^{<t>}) \log (1 - \hat y^{<t>})\\ \mathcal L(\hat y, y) &= \sum_{t = 1}^{T_y} \mathcal L^{<t>} (\hat y^{<t>}, y^{<t>}) \end{aligned}

## Different Types of RNN

• many-to-one architecture:

Sentiment Classification

• one-to-many architecture

Music Generation

• many-to-many architecture:

Machine Traslation: input, output can be diffent lengths. (encoder, decoder)

## Language Model and Sequence Generation

• Language modelling

give the probability of a sentence: $$P(\text{setence}) = ?$$

basic job: estimates the probability of sequences $$P(y^{<1>}, \dots, y^{<T_y>})$$

• Traingning set: large corpus of english text.

• add $$\text{<EOS>}$$ at the end of sentence.
• replace the unkown words with $$\text{<UNK>}$$
• Training with RNN model

replace the $$x^{<i>}$$ with $$y^{<i - 1>}$$ .

$P(y^{<1>}, y^{<2>}, y^{<3>}) = P(y^{<1>}) P(y^{<2>} | y^{<1>}) P(y^{<3>} | y^{<1>}, y^{<2>})$

## Sampling novel sequences

• Sampling a sequence from a trained RNN

Generate the sentence word by word.

• Character-level language model

$$\text{Vocabulary = [a, b, c, \dots]}$$

Basic RNNs is not very good at capturing long-range dependencies.

## Gated Recurrent Unit(GRU)

### GRU(simplified)

$$c = \text{memeory cell}$$ and $$c^{<t>} = a^{<t>}$$

\begin{aligned} \tilde{c}^{<t>} &= \tanh (w_c \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_c)\\ \Gamma_u &= \sigma(w_u \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_u) \in [0, 1]\\ c^{<t>} &= \Gamma_u \times \tilde{c}^{<t>} + (1 - \Gamma_u) \times c^{<t - 1>} & (\text{element-wise}) \end{aligned}

$$\tilde{c}^{<t>}$$ is a candidate for replacing $$c^{<t>}$$

$$\Gamma_u$$ as being either $$0$$ or $$1$$ most of the time.

if $$\Gamma_u \approx 0$$ , the $$c^{<t>}$$ is maintained pretty much exactly even across many times that.

• learn even very long-range dependencies

### Full GRU

\begin{aligned} \tilde{c}^{<t>} &= \tanh (w_c \begin{bmatrix} \Gamma_r \times c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_c)\\ \Gamma_u &= \sigma(w_u \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_u) \in [0, 1]\\ \Gamma_r &= \sigma(w_r \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_r)\\ c^{<t>} &= \Gamma_u \times \tilde{c}^{<t>} + (1 - \Gamma_u) \times c^{<t - 1>} &(\text{element-wise})\\ a^{<t>} &= c^{<t>} \end{aligned}

$$\Gamma_r$$ is a standing of relevance

## Long Short Term Memory (LSTM)

\begin{aligned} \tilde{c}^{<t>} &= \tanh (w_c \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_c)\\ \Gamma_u &= \sigma(w_u \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_u) \in [0, 1] &(\text{update})\\ \Gamma_f &= \sigma(w_f \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_f) &(\text{forget})\\ \Gamma_o &= \sigma(w_o \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_o) &(\text{output})\\ c^{<t>} &= \Gamma_u \times \tilde{c}^{<t>} + \Gamma_f \times c^{<t - 1>} &(\text{element-wise})\\ a^{<t>} &= \Gamma_o \times \tanh(c^{<t>}) \end{aligned}

peephole connection(element-wise): fifth element affect fifth element.

## Bidirectional RNN

$$\overrightarrow a^{<t>}$$ forward prop

Acyclic graph

$\hat y^{<t>} = g(w_y \begin{bmatrix} \overrightarrow a^{<t>}\\ \overleftarrow{x}^{<t>} \end{bmatrix} + b_y)\\$

BRNN with LSTM blocks would be a pretty reasonable first thing to try

# Homework: Improvise a Jazz Solo with an LSTM Network

You would like to create a jazz music piece specially for a friend's birthday. However, you don't know how to play any instruments, or how to compose music. Fortunately, you know deep learning and will solve this problem using an LSTM network!

You will train a network to generate novel jazz solos in a style representative of a body of performed work. 😎🎷

There's something coming into me when I saw it... Aye...

## Exercise 1 - djmodel

n_values = 90 # number of music values
reshaper = Reshape((1, n_values))                  # Used in Step 2.B of djmodel(), below
LSTM_cell = LSTM(n_a, return_state = True)         # Used in Step 2.C
densor = Dense(n_values, activation='softmax')     # Used in Step 2.D

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def djmodel(Tx, LSTM_cell, densor, reshaper):
"""
Implement the djmodel composed of Tx LSTM cells where each cell is responsible
for learning the following note based on the previous note and context.
Each cell has the following schema:
[X_{t}, a_{t-1}, c0_{t-1}] -> RESHAPE() -> LSTM() -> DENSE()
Arguments:
Tx -- length of the sequences in the corpus
LSTM_cell -- LSTM layer instance
densor -- Dense layer instance
reshaper -- Reshape layer instance

Returns:
model -- a keras instance model with inputs [X, a0, c0]
"""
# Get the shape of input values
n_values = densor.units

# Get the number of the hidden state vector
n_a = LSTM_cell.units

# Define the input layer and specify the shape
X = Input(shape=(Tx, n_values))

# Define the initial hidden state a0 and initial cell state c0
# using Input
a0 = Input(shape=(n_a,), name='a0')
c0 = Input(shape=(n_a,), name='c0')
a = a0
c = c0
### START CODE HERE ###
# Step 1: Create empty list to append the outputs while you iterate (≈1 line)
outputs = []

# Step 2: Loop over tx
for t in range(Tx):

# Step 2.A: select the "t"th time step vector from X.
x = X[:, t, :]
# Step 2.B: Use reshaper to reshape x to be (1, n_values) (≈1 line)
x = reshaper(x)
# Step 2.C: Perform one step of the LSTM_cell
a, _, c = LSTM_cell(x, initial_state=[a, c])
# Step 2.D: Apply densor to the hidden state output of LSTM_Cell
out = densor(a)
# Step 2.E: add the output to "outputs"
outputs.append(out)

# Step 3: Create model instance
model = Model(inputs=[X, a0, c0], outputs=outputs)

### END CODE HERE ###

return model


We will use:

• Loss function: categorical cross-entropy (for multi-class classification)
opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)

model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit([X, a0, c0], list(Y), epochs=100, verbose = 0)


## Exercise 2 - music_inference_model

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def music_inference_model(LSTM_cell, densor, Ty=100):
"""
Uses the trained "LSTM_cell" and "densor" from model() to generate a sequence of values.

Arguments:
LSTM_cell -- the trained "LSTM_cell" from model(), Keras layer object
densor -- the trained "densor" from model(), Keras layer object
Ty -- integer, number of time steps to generate

Returns:
inference_model -- Keras model instance
"""

# Get the shape of input values
n_values = densor.units
# Get the number of the hidden state vector
n_a = LSTM_cell.units

# Define the input of your model with a shape
x0 = Input(shape=(1, n_values))

# Define s0, initial hidden state for the decoder LSTM
a0 = Input(shape=(n_a,), name='a0')
c0 = Input(shape=(n_a,), name='c0')
a = a0
c = c0
x = x0

### START CODE HERE ###
# Step 1: Create an empty list of "outputs" to later store your predicted values (≈1 line)
outputs = []

# Step 2: Loop over Ty and generate a value at every time step
for t in range(Ty):
# Step 2.A: Perform one step of LSTM_cell. Use "x", not "x0" (≈1 line)
a, _, c = LSTM_cell(x, initial_state = [a, c])

# Step 2.B: Apply Dense layer to the hidden state output of the LSTM_cell (≈1 line)
out = densor(a)
# Step 2.C: Append the prediction "out" to "outputs". out.shape = (None, 90) (≈1 line)
outputs.append(out)

# Step 2.D:
# Select the next value according to "out",
# Set "x" to be the one-hot representation of the selected value
# See instructions above.
x = tf.math.argmax(out, axis = -1)
x = tf.one_hot(x, depth = n_values)
# Step 2.E:
# Use RepeatVector(1) to convert x into a tensor with shape=(None, 1, 90)
x = RepeatVector(1)(x)

# Step 3: Create model instance with the correct "inputs" and "outputs" (≈1 line)
inference_model = Model(inputs=[x0, a0, c0], outputs=outputs)

### END CODE HERE ###

return inference_model


## Exercise 3 - predict_and_sample

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def predict_and_sample(inference_model, x_initializer = x_initializer, a_initializer = a_initializer,
c_initializer = c_initializer):
"""
Predicts the next value of values using the inference model.

Arguments:
inference_model -- Keras model instance for inference time
x_initializer -- numpy array of shape (1, 1, 90), one-hot vector initializing the values generation
a_initializer -- numpy array of shape (1, n_a), initializing the hidden state of the LSTM_cell
c_initializer -- numpy array of shape (1, n_a), initializing the cell state of the LSTM_cel

Returns:
results -- numpy-array of shape (Ty, 90), matrix of one-hot vectors representing the values generated
indices -- numpy-array of shape (Ty, 1), matrix of indices representing the values generated
"""

n_values = x_initializer.shape

### START CODE HERE ###
# Step 1: Use your inference model to predict an output sequence given x_initializer, a_initializer and c_initializer.
pred = inference_model.predict([x_initializer, a_initializer, c_initializer])
# Step 2: Convert "pred" into an np.array() of indices with the maximum probabilities
indices = np.argmax(pred, axis = -1)
# Step 3: Convert indices to one-hot vectors, the shape of the results should be (Ty, n_values)
results = to_categorical(indices, num_classes = x_initializer.shape[-1])
### END CODE HERE ###

return results, indices