+ - 0:00:00
Notes for current slide
Notes for next slide

Long short-term memory (LSTM) networks

Lin Yu

2022-08-29

1 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26

Neural Networks

(Main idea)

3 / 26

Terminology Alert🤯!!!

  • Input layer
  • Hidden layer(s)
  • Output layer
  • Node/Neuron
  • weights and biases
  • Activation function
4 / 26

Activation functions at a glance

5 / 26

Main idea of NN

flip, twist, and/or stretch the activation functions

6 / 26

Update parameters

That is, to minimize the loss function

  • initialize weights from N (0,1), set biases to zeroes
  • forward propagation: calculate predicted values
  • back propagation and the chain rule

SSR= Σi=1n(ObservediPredictedi)2

Predicted ~ σ1w3+σ2w4 σ(z)=log(1+ez)

z=w1x

derivative = dSSRdw1 = dSSRdpred dpreddσ dσdz* dzdw1

new w1 = old w1 - α * derivative

7 / 26

The more common framework

8 / 26

However...

9 / 26

However...

traditional NNs

only accept fixed-length input

9 / 26

However...

traditional NNs

only accept fixed-length input

that is,

cannot be applied to sequential(e.g., audio/sentence) or time series data

9 / 26

Scenarios

  • Example 1: use Amazon review to predict customer behavior (buying/not buying)

Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again!

  • Example 2: stock price prediction

    • company 1: Day1, Day2 , Day3 , Day4 ,... Dayn;

    • company 2: Day1001, Day1002 , Day1003 , Day1004 ,... Day100...

We see:

  • Different amounts of input

  • Based on historical information (correlation)

So, traditional NN does not work in these scenarios

10 / 26

Here comes the RNN to help!

11 / 26

Recurrent Neural Networks

(Main idea)

12 / 26

Feedback loops

Information before is taken in and passed on to the next cell

13 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as xt1 and xt, we want to predict tomorrow's stock price.

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as xt1 and xt, we want to predict tomorrow's stock price.

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as xt1 and xt, we want to predict tomorrow's stock price. A little bit of formulation...

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as xt1 and xt, we want to predict tomorrow's stock price. A little bit of formulation...

ht1 = tanh(w1xt1+b1)

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as xt1 and xt, we want to predict tomorrow's stock price. A little bit of formulation...

ht1 = tanh(w1xt1+b1)

ht= tanh(w2ht1 + w1xt + b1)

14 / 26

An example

Given yesterday's and today's stock prices (i.e., time series data), noted as xt1 and xt, we want to predict tomorrow's stock price. A little bit of formulation...

ht1 = tanh(w1xt1+b1)

ht= tanh(w2ht1 + w1xt + b1)

btw ht is called the hidden state

14 / 26

Properties of RNNs

ht1 = tanh(w1xt1+b1)

ht= tanh(w2ht1 + w1xt + b1)

RNNs have the following properties:

  • Weights and biases are shared across across every input

  • The exploding/vanishing gradient problem: The more we unroll a RNN, the harder it is to train

  • Not capable to handle long-term dependencies,only retains short term memory(proved by Hochreiter (1991) and Bengio, et al. (1994))

(Reminder: skip the following two pages if not interested)

15 / 26

The exploding/vanishing gradient problem

Taken the stock price prediction task, suppose we had 100 sequential days of stock price data

The gradient/step size: dSSRdw1 = dSSRdpred dpreddh dhdz dzdw1, where z = w1 x1 + b1

h1 = tanh(w1x1+b1);

h2= tanh(w2h1 + w1x2 + b1)

h3 = tanh(w2h2 + w1x3 + b1)

simplified: h3 ~ f[w2f[w2f(w1x1)]]

... ...

h100 ~ f[w2f[w2f[w2f[w2....f[w2f(w1x1)]]]]] ~ ffff...f(w299w1x1)

continued, see next page

16 / 26

The gradient: dSSRdw1 = dSSRdpred dpreddh dhdz dzdw1, where z = w1 x1 + b1

h100 ~ f[w2f[w2f[w2f[w2....f[w2f(w1x1)]]]]] ~ ffff...f(w299w1x1)

dzdw1 = x1w299,so the gradient will be like: dSSRdw1 = something* w299

new w = old w - α * derivative

If w2>1, gradient explodes; if w2<1, gradient vanishes

17 / 26

千呼万唤始出来... LSTM💣

18 / 26

The GATES

  • A way to optionally let information through (remove/add) to the cell state;
  • composition: sigmoid layer + a point-wise multiplication operation;
  • The sigmoid ∈(0,1), describing how much of each component should be let through.

19 / 26

Forget gate

Decides what information will be removed/thrown away in the cell state

Previous example: Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again!

20 / 26

Input gates

  • part 1: input layer gate decides values to be updated
  • part 2: input modulation gate create a new vector of candidate values
21 / 26

Input gates

  • part 1: input layer gate decides values to be updated
  • part 2: input modulation gate create a new vector of candidate values Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used to modulate the information that the Input gate will write onto the Internal State Cell by adding non-linearity to the information and making the information Zero-mean. This is done to reduce the learning time as Zero-mean input has faster convergence. Although this gate’s actions are less important than the others and are often treated as a finesse-providing concept, it is good practice to include this gate in the structure of the LSTM unit.
21 / 26

Update the old cell state

  • part 1: multiply old cell state by the forget gate
  • part 2: add the input gate to keep the updated information

22 / 26

Output gate

output the cell state (with some conditions)

  • part 1: a sigmoid layer to decide what parts of the cell state we're going to output
  • part 2: push cell state with a tanh, and multiply it by the output gate

23 / 26

End of this presentation, thanks!

24 / 26

An example with python code

  • Import necessary libs
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Dropout
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
import seaborn as sns
  • Import data
df = pd.read_csv("data/GE.csv")
train_datesdates = pd.to_datetime(df['Date'])
cols = list(df)[1:6]
df_for_training = df[cols].astype(float)
## scale the data
scaler = StandardScaler()
scaler = scaler.fit(df_for_training)
df_for_training_scaled = scaler.transform(df_for_training)
25 / 26
## preview first five rows
df_for_training.head(5)
## Open High Low Close Adj Close
## 0 101.290001 103.510002 101.059998 103.269997 102.884109
## 1 103.360001 105.129997 102.550003 104.699997 104.308762
## 2 104.459999 104.620003 102.839996 103.379997 102.993690
## 3 103.900002 106.150002 103.900002 106.089996 105.693565
## 4 106.330002 106.459999 104.800003 105.190002 104.796944
26 / 26

An example

Objective: investigate the relationship between drug dosage and the efficacy

2 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
sToggle scribble toolbox
Alt + fFit Slides to Screen
Esc Back to slideshow