class: center, middle, inverse, title-slide .title[ # Long short-term memory (LSTM) networks ] .author[ ### Lin Yu ] .date[ ### 2022-08-29 ] --- # An example Objective: investigate the relationship between drug **dosage** and the **efficacy** -- .pull-left[ ![](data:image/png;base64,#figure/NN_example_f1.png) ] -- .pull-right[ ![](data:image/png;base64,#figure/NN_example_f2.png) ] --- class: center,inverse, middle # Neural Networks ##(Main idea) --- ![](data:image/png;base64,#figure/NN_example_f3.png) **Terminology Alert🤯!!!** .pull-left[ - Input layer - Hidden layer(s) - Output layer ] .pull-right[ - Node/Neuron - weights and biases - Activation function ] --- # Activation functions at a glance ![](data:image/png;base64,#figure/activation.png) --- # Main idea of NN ## flip, twist, and/or stretch the activation functions ![](data:image/png;base64,#figure/NN_example_f4.png) --- # Update parameters .pull-left[ That is, to minimize the loss function - initialize weights from `\(N~(0,1)\)`, set biases to zeroes - forward propagation: calculate predicted values - back propagation and the chain rule `\(SSR =\)` `\(\Sigma_{i=1}^n(Observed_i - Predicted_i)^2\)` `\(Predicted\)` ~ `\(\sigma_1*w_3+\sigma_2*w_4\)` `\(\sigma(z) = log(1+e^z)\)` `\(z = w_1*x\)` _derivative_ = `\(\frac{d_{SSR}}{d_{w_1}}\)` = `\(\frac{d_{SSR}}{d_{pred}}\)` * `\(\frac{d_{pred}}{d_{\sigma}}\)`* `\(\frac{d_{\sigma}}{d_{z}}\)`* `\(\frac{d_z}{d_{w_1}}\)` `\(new\)` `\(w_1\)` = `\(old\)` `\(w_1\)` - `\(\alpha\)` * _derivative_ ] .pull-right[ ![](data:image/png;base64,#figure/backpropagation.png) ![](data:image/png;base64,#figure/loss.png) ] --- # The more common framework ![](data:image/png;base64,#figure/NN_example_f5.png) --- class: center, middle, inverse # However... -- ## traditional NNs ## only accept fixed-length input -- ## that is, ## cannot be applied to sequential(e.g., audio/sentence) or time series data --- # Scenarios - **Example 1**: use Amazon review to predict customer behavior (buying/not buying) Amazing, this box of cereal gave me a perfectly balanced breakfast, as all things should be. I only ate half of it but will definitely be buying again! - **Example 2**: stock price prediction - company 1: `\(Day_1\)`, `\(Day_2\)` , `\(Day_3\)` , `\(Day_4\)` ,... `\(Day_n\)`; - company 2: `\(Day_{1001}\)`, `\(Day_{1002}\)` , `\(Day_{1003}\)` , `\(Day_{1004}\)` ,... `\(Day_{100...}\)` ### We see: - Different amounts of input - Based on historical information (correlation) _So, traditional NN does not work in these scenarios_ --- background-image: url(data:image/png;base64,#https://media3.giphy.com/media/10bKPDUM5H7m7u/giphy.gif) background-position: 50% 50% class: center, bottom, inverse # Here comes the RNN to help! --- class: inverse, middle, center # Recurrent Neural Networks ## (Main idea) --- # Feedback loops ### Information before is taken in and passed on to the next cell ![](data:image/png;base64,#figure/RNN_1.png) --- # An example Given yesterday's and today's stock prices (i.e., time series data), noted as `\(x_{t-1}\)` and `\(x_{t}\)`, we want to predict tomorrow's stock price. -- ![](data:image/png;base64,#figure/RNN.png) -- A little bit of formulation... -- .pull-left[ `\(h_{t-1}\)` = `\(tanh(w_1*x_{t-1}+b_1)\)` ] -- .pull-right[ `\(h_t\)`= `\(tanh(w_2*h_{t-1}\)` + `\(w_1*x_{t}\)` + `\(b_1\)`) ] -- _btw_ `\(h_t\)` _is called the hidden state_ --- # Properties of RNNs `\(h_{t-1}\)` = `\(tanh(w_1*x_{t-1}+b_1)\)` `\(h_t\)`= `\(tanh(w_2*h_{t-1}\)` + `\(w_1*x_{t}\)` + `\(b_1\)`) RNNs have the following properties: - #### Weights and biases are shared across across every input - #### The exploding/vanishing gradient problem: The more we unroll a RNN, the harder it is to train - #### Not capable to handle long-term dependencies,only retains short term memory(proved by [Hochreiter (1991)](https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf) and [Bengio, et al. (1994)](http://www-dsi.ing.unifi.it/~paolo/ps/tnn-94-gradient.pdf)) (Reminder: skip the following two pages if not interested) --- # The exploding/vanishing gradient problem Taken the stock price prediction task, suppose we had 100 sequential days of stock price data The gradient/step size: `\(\frac{d_{SSR}}{d_{w_1}}\)` = `\(\frac{d_{SSR}}{d_{pred}}\)` * `\(\frac{d_{pred}}{d_{h}}\)`* `\(\frac{d_{h}}{d_{z}}\)`* `\(\frac{d_z}{d_{w_1}}\)`, where `\(z\)` = `\(w_1\)` * `\(x_{1}\)` + `\(b_1\)` `\(h_{1}\)` = `\(tanh(w_1*x_{1}+b_1)\)`; `\(h_2\)`= `\(tanh(w_2*h_{1}\)` + `\(w_1*x_{2}\)` + `\(b_1)\)` `\(h_3\)` = `\(tanh(w_2*h_{2}\)` + `\(w_1*x_{3}\)` + `\(b_1\)`) simplified: `\(h_3\)` ~ `\(f[w_2*f[w_2*f(w_1*x_1)]]\)` ... ... `\(h_{100}\)` ~ `\(f[w_2*f[w_2*f[w_2*f[w_2*....f[w_2*f(w_1*x1)]]]]]\)` ~ `\(ffff...f(w_2^{99}*w_1*x_1)\)` continued, see next page --- The gradient: `\(\frac{d_{SSR}}{d_{w_1}}\)` = `\(\frac{d_{SSR}}{d_{pred}}\)` * `\(\frac{d_{pred}}{d_{h}}\)`* `\(\frac{d_{h}}{d_{z}}\)`* `\(\frac{d_z}{d_{w1}}\)`, where `\(z\)` = `\(w_1\)` * `\(x_{1}\)` + `\(b_1\)` `\(h_{100}\)` ~ `\(f[w_2*f[w_2*f[w_2*f[w_2*....f[w_2*f(w_1*x1)]]]]]\)` ~ `\(ffff...f(w_2^{99}*w_1*x_1)\)` `\(\frac{d_z}{d_{w1}}\)` = `\(x_1*w_2^{99}\)`,so the gradient will be like: `\(\frac{d_{SSR}}{d_{w_1}}\)` = something* `\(w_2^{99}\)` `\(new\)` `\(w\)` = `\(old\)` `\(w\)` - `\(\alpha\)` * _derivative_ If `\(w_2 >1\)`, gradient explodes; if `\(w_2 < 1\)`, gradient vanishes ![](data:image/png;base64,#figure/loss.png) --- class: center,inverse, middle # 千呼万唤始出来... LSTM💣 --- ## The GATES - A way to optionally let information through (remove/add) to the cell state; - composition: sigmoid layer + a point-wise multiplication operation; - The sigmoid ∈(0,1), describing how much of each component should be let through. ![](data:image/png;base64,#figure/lstm_cell.png) --- # Forget gate ### Decides what information will be removed/thrown away in the cell state Previous example: **Amazing**, this box of cereal gave me a **perfectly balanced breakfast**, as all things should be. I only ate half of it but will **definitely be buying again**! ![](data:image/png;base64,#figure/forget.PNG) --- # Input gates - part 1: input layer gate decides values to be updated - part 2: input modulation gate create a new vector of candidate values ![](data:image/png;base64,#figure/input.PNG) -- _Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used to modulate the information that the Input gate will write onto the Internal State Cell by **adding non-linearity to the information and making the information Zero-mean**. This is done to **reduce the learning time as Zero-mean input has faster convergence**. Although this gate’s actions are less important than the others and are often treated as a finesse-providing concept, it is good practice to include this gate in the structure of the LSTM unit._ --- # Update the old cell state - part 1: multiply old cell state by the forget gate - part 2: add the input gate to keep the updated information ![](data:image/png;base64,#figure/cellstate.png) --- # Output gate #### output the cell state (with some conditions) - part 1: a sigmoid layer to decide what parts of the cell state we're going to output - part 2: push cell state with a tanh, and multiply it by the output gate ![](data:image/png;base64,#figure/output.png) --- class: middle, center, inverse # End of this presentation, thanks! --- # An example with python code - Import necessary libs ```python import numpy as np from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense, Dropout import pandas as pd from matplotlib import pyplot as plt from sklearn.preprocessing import StandardScaler import seaborn as sns ``` - Import data ```python df = pd.read_csv("data/GE.csv") train_datesdates = pd.to_datetime(df['Date']) cols = list(df)[1:6] df_for_training = df[cols].astype(float) ## scale the data scaler = StandardScaler() scaler = scaler.fit(df_for_training) df_for_training_scaled = scaler.transform(df_for_training) ``` --- ```python ## preview first five rows df_for_training.head(5) ``` ``` ## Open High Low Close Adj Close ## 0 101.290001 103.510002 101.059998 103.269997 102.884109 ## 1 103.360001 105.129997 102.550003 104.699997 104.308762 ## 2 104.459999 104.620003 102.839996 103.379997 102.993690 ## 3 103.900002 106.150002 103.900002 106.089996 105.693565 ## 4 106.330002 106.459999 104.800003 105.190002 104.796944 ```