Long short-term memory (LSTM) networks
Lin Yu
2022-08-29
1 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

Neural Networks(Main idea)3 / 26

Terminology Alert🤯!!!

Input layer
Hidden layer(s)
Output layer

Node/Neuron
weights and biases
Activation function

4 / 26

Activation functions at a glance

5 / 26

Main idea of NN

flip, twist, and/or stretch the activation functions

6 / 26

Update parameters

That is, to minimize the loss function

initialize weights from \(N~(0,1)\), set biases to zeroes
forward propagation: calculate predicted values
back propagation and the chain rule

\(SSR =\) \(\Sigma_{i=1}^n(Observed_i - Predicted_i)^2\)

\(Predicted\) ~ \(\sigma_1*w_3+\sigma_2*w_4\) \(\sigma(z) = log(1+e^z)\)

\(z = w_1*x\)

derivative = \(\frac{d_{SSR}}{d_{w_1}}\) = \(\frac{d_{SSR}}{d_{pred}}\) \(\frac{d_{pred}}{d_{\sigma}}\) \(\frac{d_{\sigma}}{d_{z}}\)* \(\frac{d_z}{d_{w_1}}\)

\(new\) \(w_1\) = \(old\) \(w_1\) - \(\alpha\) * derivative

7 / 26

The more common framework

8 / 26

However...9 / 26

However...traditional NNsonly accept fixed-length input9 / 26

However...traditional NNsonly accept fixed-length inputthat is,cannot be applied to sequential(e.g., audio/sentence) or time series data9 / 26

Scenarios

Example 1: use Amazon review to predict customer behavior (buying/not buying)

Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again!

Example 2: stock price prediction
- company 1: \(Day_1\), \(Day_2\) , \(Day_3\) , \(Day_4\) ,... \(Day_n\);
- company 2: \(Day_{1001}\), \(Day_{1002}\) , \(Day_{1003}\) , \(Day_{1004}\) ,... \(Day_{100...}\)

We see:

Different amounts of input
Based on historical information (correlation)

So, traditional NN does not work in these scenarios

10 / 26

Here comes the RNN to help!11 / 26

Recurrent Neural Networks(Main idea)12 / 26

Feedback loops

Information before is taken in and passed on to the next cell

13 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price.

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price.

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

\(h_t\)= \(tanh(w_2*h_{t-1}\) + \(w_1*x_{t}\) + \(b_1\))

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

\(h_t\)= \(tanh(w_2*h_{t-1}\) + \(w_1*x_{t}\) + \(b_1\))

btw \(h_t\) is called the hidden state

14 / 26

Properties of RNNs

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

\(h_t\)= \(tanh(w_2*h_{t-1}\) + \(w_1*x_{t}\) + \(b_1\))

RNNs have the following properties:

Weights and biases are shared across across every input

The exploding/vanishing gradient problem: The more we unroll a RNN, the harder it is to train
Not capable to handle long-term dependencies,only retains short term memory(proved by Hochreiter (1991) and Bengio, et al. (1994))

(Reminder: skip the following two pages if not interested)

15 / 26

The exploding/vanishing gradient problem

Taken the stock price prediction task, suppose we had 100 sequential days of stock price data

The gradient/step size: \(\frac{d_{SSR}}{d_{w_1}}\) = \(\frac{d_{SSR}}{d_{pred}}\) \(\frac{d_{pred}}{d_{h}}\) \(\frac{d_{h}}{d_{z}}\) \(\frac{d_z}{d_{w_1}}\), where \(z\) = \(w_1\) \(x_{1}\) + \(b_1\)

\(h_{1}\) = \(tanh(w_1*x_{1}+b_1)\);

\(h_2\)= \(tanh(w_2*h_{1}\) + \(w_1*x_{2}\) + \(b_1)\)

\(h_3\) = \(tanh(w_2*h_{2}\) + \(w_1*x_{3}\) + \(b_1\))

simplified: \(h_3\) ~ \(f[w_2*f[w_2*f(w_1*x_1)]]\)

... ...

\(h_{100}\) ~ \(f[w_2*f[w_2*f[w_2*f[w_2*....f[w_2*f(w_1*x1)]]]]]\) ~ \(ffff...f(w_2^{99}*w_1*x_1)\)

continued, see next page

16 / 26

The gradient: \(\frac{d_{SSR}}{d_{w_1}}\) = \(\frac{d_{SSR}}{d_{pred}}\) \(\frac{d_{pred}}{d_{h}}\) \(\frac{d_{h}}{d_{z}}\) \(\frac{d_z}{d_{w1}}\), where \(z\) = \(w_1\) \(x_{1}\) + \(b_1\)

\(h_{100}\) ~ \(f[w_2*f[w_2*f[w_2*f[w_2*....f[w_2*f(w_1*x1)]]]]]\) ~ \(ffff...f(w_2^{99}*w_1*x_1)\)

\(\frac{d_z}{d_{w1}}\) = \(x_1*w_2^{99}\),so the gradient will be like: \(\frac{d_{SSR}}{d_{w_1}}\) = something* \(w_2^{99}\)

\(new\) \(w\) = \(old\) \(w\) - \(\alpha\) * derivative

If \(w_2 >1\), gradient explodes; if \(w_2 < 1\), gradient vanishes

17 / 26

千呼万唤始出来... LSTM💣18 / 26

The GATES

A way to optionally let information through (remove/add) to the cell state;
composition: sigmoid layer + a point-wise multiplication operation;
The sigmoid ∈(0,1), describing how much of each component should be let through.

19 / 26

Forget gate

Decides what information will be removed/thrown away in the cell state

Previous example: Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again!

20 / 26

Input gates

part 1: input layer gate decides values to be updated
part 2: input modulation gate create a new vector of candidate values

21 / 26

Input gates

part 1: input layer gate decides values to be updated
part 2: input modulation gate create a new vector of candidate values Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used to modulate the information that the Input gate will write onto the Internal State Cell by adding non-linearity to the information and making the information Zero-mean. This is done to reduce the learning time as Zero-mean input has faster convergence. Although this gate’s actions are less important than the others and are often treated as a finesse-providing concept, it is good practice to include this gate in the structure of the LSTM unit.

21 / 26

Update the old cell state

part 1: multiply old cell state by the forget gate
part 2: add the input gate to keep the updated information

22 / 26

Output gate

output the cell state (with some conditions)

part 1: a sigmoid layer to decide what parts of the cell state we're going to output
part 2: push cell state with a tanh, and multiply it by the output gate

23 / 26

End of this presentation, thanks!24 / 26

An example with python code

Import necessary libs


import numpy as np
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Dropout
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
import seaborn as sns

Import data

df = pd.read_csv("data/GE.csv")
train_datesdates = pd.to_datetime(df['Date'])
cols = list(df)[1:6]
df_for_training = df[cols].astype(float)
## scale the data
scaler = StandardScaler()
scaler = scaler.fit(df_for_training)
df_for_training_scaled = scaler.transform(df_for_training)

25 / 26

## preview first five rows
df_for_training.head(5)

##          Open        High         Low       Close   Adj Close
## 0  101.290001  103.510002  101.059998  103.269997  102.884109
## 1  103.360001  105.129997  102.550003  104.699997  104.308762
## 2  104.459999  104.620003  102.839996  103.379997  102.993690
## 3  103.900002  106.150002  103.900002  106.089996  105.693565
## 4  106.330002  106.459999  104.800003  105.190002  104.796944

26 / 26

Long short-term memory (LSTM) networks
Lin Yu
2022-08-29
1 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

Neural Networks(Main idea)3 / 26

Terminology Alert🤯!!!

Input layer
Hidden layer(s)
Output layer

Node/Neuron
weights and biases
Activation function

4 / 26

Activation functions at a glance

5 / 26

Main idea of NN

flip, twist, and/or stretch the activation functions

6 / 26

Update parameters

That is, to minimize the loss function

initialize weights from \(N~(0,1)\), set biases to zeroes
forward propagation: calculate predicted values
back propagation and the chain rule

\(SSR =\) \(\Sigma_{i=1}^n(Observed_i - Predicted_i)^2\)

\(Predicted\) ~ \(\sigma_1*w_3+\sigma_2*w_4\) \(\sigma(z) = log(1+e^z)\)

\(z = w_1*x\)

derivative = \(\frac{d_{SSR}}{d_{w_1}}\) = \(\frac{d_{SSR}}{d_{pred}}\) \(\frac{d_{pred}}{d_{\sigma}}\) \(\frac{d_{\sigma}}{d_{z}}\)* \(\frac{d_z}{d_{w_1}}\)

\(new\) \(w_1\) = \(old\) \(w_1\) - \(\alpha\) * derivative

7 / 26

The more common framework

8 / 26

However...9 / 26

However...traditional NNsonly accept fixed-length input9 / 26

However...traditional NNsonly accept fixed-length inputthat is,cannot be applied to sequential(e.g., audio/sentence) or time series data9 / 26

Scenarios

Example 1: use Amazon review to predict customer behavior (buying/not buying)

Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again!

Example 2: stock price prediction
- company 1: \(Day_1\), \(Day_2\) , \(Day_3\) , \(Day_4\) ,... \(Day_n\);
- company 2: \(Day_{1001}\), \(Day_{1002}\) , \(Day_{1003}\) , \(Day_{1004}\) ,... \(Day_{100...}\)

We see:

Different amounts of input
Based on historical information (correlation)

So, traditional NN does not work in these scenarios

10 / 26

Here comes the RNN to help!11 / 26

Recurrent Neural Networks(Main idea)12 / 26

Feedback loops

Information before is taken in and passed on to the next cell

13 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price.

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price.

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

\(h_t\)= \(tanh(w_2*h_{t-1}\) + \(w_1*x_{t}\) + \(b_1\))

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as \(x_{t-1}\) and \(x_{t}\), we want to predict tomorrow's stock price. A little bit of formulation...

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

\(h_t\)= \(tanh(w_2*h_{t-1}\) + \(w_1*x_{t}\) + \(b_1\))

btw \(h_t\) is called the hidden state

14 / 26

Properties of RNNs

\(h_{t-1}\) = \(tanh(w_1*x_{t-1}+b_1)\)

\(h_t\)= \(tanh(w_2*h_{t-1}\) + \(w_1*x_{t}\) + \(b_1\))

RNNs have the following properties:

Weights and biases are shared across across every input

The exploding/vanishing gradient problem: The more we unroll a RNN, the harder it is to train
Not capable to handle long-term dependencies,only retains short term memory(proved by Hochreiter (1991) and Bengio, et al. (1994))

(Reminder: skip the following two pages if not interested)

15 / 26

The exploding/vanishing gradient problem

Taken the stock price prediction task, suppose we had 100 sequential days of stock price data

\(h_{1}\) = \(tanh(w_1*x_{1}+b_1)\);

\(h_2\)= \(tanh(w_2*h_{1}\) + \(w_1*x_{2}\) + \(b_1)\)

\(h_3\) = \(tanh(w_2*h_{2}\) + \(w_1*x_{3}\) + \(b_1\))

simplified: \(h_3\) ~ \(f[w_2*f[w_2*f(w_1*x_1)]]\)

... ...

\(h_{100}\) ~ \(f[w_2*f[w_2*f[w_2*f[w_2*....f[w_2*f(w_1*x1)]]]]]\) ~ \(ffff...f(w_2^{99}*w_1*x_1)\)

continued, see next page

16 / 26

The gradient: \(\frac{d_{SSR}}{d_{w_1}}\) = \(\frac{d_{SSR}}{d_{pred}}\) \(\frac{d_{pred}}{d_{h}}\) \(\frac{d_{h}}{d_{z}}\) \(\frac{d_z}{d_{w1}}\), where \(z\) = \(w_1\) \(x_{1}\) + \(b_1\)

\(h_{100}\) ~ \(f[w_2*f[w_2*f[w_2*f[w_2*....f[w_2*f(w_1*x1)]]]]]\) ~ \(ffff...f(w_2^{99}*w_1*x_1)\)

\(\frac{d_z}{d_{w1}}\) = \(x_1*w_2^{99}\),so the gradient will be like: \(\frac{d_{SSR}}{d_{w_1}}\) = something* \(w_2^{99}\)

\(new\) \(w\) = \(old\) \(w\) - \(\alpha\) * derivative

If \(w_2 >1\), gradient explodes; if \(w_2 < 1\), gradient vanishes

17 / 26

千呼万唤始出来... LSTM💣18 / 26

The GATES

A way to optionally let information through (remove/add) to the cell state;
composition: sigmoid layer + a point-wise multiplication operation;
The sigmoid ∈(0,1), describing how much of each component should be let through.

19 / 26

Forget gate

Decides what information will be removed/thrown away in the cell state

Previous example: Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again!

20 / 26

Input gates

part 1: input layer gate decides values to be updated
part 2: input modulation gate create a new vector of candidate values

21 / 26

Input gates

part 1: input layer gate decides values to be updated
part 2: input modulation gate create a new vector of candidate values Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used to modulate the information that the Input gate will write onto the Internal State Cell by adding non-linearity to the information and making the information Zero-mean. This is done to reduce the learning time as Zero-mean input has faster convergence. Although this gate’s actions are less important than the others and are often treated as a finesse-providing concept, it is good practice to include this gate in the structure of the LSTM unit.

21 / 26

Update the old cell state

part 1: multiply old cell state by the forget gate
part 2: add the input gate to keep the updated information

22 / 26

Output gate

output the cell state (with some conditions)

part 1: a sigmoid layer to decide what parts of the cell state we're going to output
part 2: push cell state with a tanh, and multiply it by the output gate

23 / 26

End of this presentation, thanks!24 / 26

An example with python code

Import necessary libs


import numpy as np
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Dropout
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
import seaborn as sns

Import data

df = pd.read_csv("data/GE.csv")
train_datesdates = pd.to_datetime(df['Date'])
cols = list(df)[1:6]
df_for_training = df[cols].astype(float)
## scale the data
scaler = StandardScaler()
scaler = scaler.fit(df_for_training)
df_for_training_scaled = scaler.transform(df_for_training)

25 / 26

## preview first five rows
df_for_training.head(5)

##          Open        High         Low       Close   Adj Close
## 0  101.290001  103.510002  101.059998  103.269997  102.884109
## 1  103.360001  105.129997  102.550003  104.699997  104.308762
## 2  104.459999  104.620003  102.839996  103.379997  102.993690
## 3  103.900002  106.150002  103.900002  106.089996  105.693565
## 4  106.330002  106.459999  104.800003  105.190002  104.796944

26 / 26

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides
s	Toggle scribble toolbox
Alt + f	Fit Slides to Screen

Long short-term memory (LSTM) networks

Lin Yu

2022-08-29

An example

An example

An example

Neural Networks

(Main idea)

Activation functions at a glance

Main idea of NN

flip, twist, and/or stretch the activation functions

Update parameters

The more common framework

However...

However...

traditional NNs

only accept fixed-length input

However...

traditional NNs

only accept fixed-length input

that is,

cannot be applied to sequential(e.g., audio/sentence) or time series data

Scenarios

We see:

Here comes the RNN to help!

Recurrent Neural Networks

(Main idea)

Feedback loops

Information before is taken in and passed on to the next cell

An example

An example

An example

An example

An example

An example

Properties of RNNs

Weights and biases are shared across across every input

The exploding/vanishing gradient problem: The more we unroll a RNN, the harder it is to train

Not capable to handle long-term dependencies,only retains short term memory(proved by Hochreiter (1991) and Bengio, et al. (1994))

The exploding/vanishing gradient problem

千呼万唤始出来... LSTM💣

The GATES

Forget gate

Decides what information will be removed/thrown away in the cell state

Input gates

Input gates

Update the old cell state

Output gate

output the cell state (with some conditions)

End of this presentation, thanks!

An example with python code

An example

Help

Long short-term memory (LSTM) networks

Long short-term memory (LSTM) networks

Lin Yu

2022-08-29

An example

An example

An example

Neural Networks

(Main idea)

Activation functions at a glance

Main idea of NN

flip, twist, and/or stretch the activation functions

Update parameters

The more common framework

However...

However...

traditional NNs

only accept fixed-length input

However...

traditional NNs

only accept fixed-length input

that is,

cannot be applied to sequential(e.g., audio/sentence) or time series data

Scenarios

We see:

Here comes the RNN to help!

Recurrent Neural Networks